MCF7 - SmartSeq - EDA¶

Introduction¶

Our experiment aims to analyze how gene expression patterns in cells are affected by different oxygen environments, specifically normoxia (normal oxygen levels) and hypoxia (reduced oxygen levels). Understanding the impact of oxygen availability on gene expression is crucial, as it plays a fundamental role in various biological processes, including cellular metabolism, development, and disease progression. By investigating the changes in gene expression under normoxic and hypoxic conditions, we can gain insights into the molecular mechanisms that cells employ to adapt and survive in low oxygen environments.

To achieve this, we will utilize two advanced sequencing methods: Smart-Seq and Drop-Seq. These methods enable us to capture the gene expression profiles of individual cells with high resolution, allowing us to examine the heterogeneity within cell populations and identify subtle transcriptional changes induced by oxygen levels. By applying these techniques to our two selected cell lines, HCC1086 and MCF7, we aim to investigate the specific responses of liver cancer cells and breast cancer cells to changes in oxygen availability.

HCC1086 is derived from hepatocellular carcinoma, the most prevalent form of liver cancer. This aggressive malignancy is characterized by uncontrolled growth and the ability to invade surrounding tissues. Understanding the alterations in gene expression patterns associated with hypoxia in HCC1086 cells is of great importance, as hypoxia is a common feature of the tumor microenvironment and has been linked to tumor progression, metastasis, and resistance to therapy in liver cancer.

On the other hand, MCF7 is a widely studied cell line that originates from human breast adenocarcinoma. Breast cancer is a complex disease with diverse subtypes and variable responses to treatment. Investigating the influence of oxygen levels on the gene expression profiles of MCF7 cells can provide valuable insights into the adaptive mechanisms of breast cancer cells under hypoxic conditions. This knowledge may contribute to the development of novel therapeutic strategies targeting hypoxia-related pathways in breast cancer.

The data provided for our analysis is structured as .csv tables, with each column representing a single sequenced cell. The cell is identified by a specific name that includes information about its growth condition (normoxia or hypoxia). Each row in the table corresponds to a gene, identified by its unique gene symbol. This structured data format allows us to efficiently analyze and compare the gene expression levels across different cells and conditions.

By following an experimental approach, we will perform EDA, unsupervised and supervised learning. Our project aims to unravel the transcriptional changes associated with normoxia and hypoxia in HCC1086 and MCF7 cell lines.

Python libraries¶

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import random
import sys
import sklearn
In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_predict
from sklearn.decomposition import PCA

Exploratory data analysis¶

Meta data¶

We read the file with the meta data:

In [2]:
data_meta = pd.read_csv("/Users/ela/Documents/AI_LAB/SmartSeq/MCF7_SmartS_MetaData.tsv",delimiter="\t",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(data_meta))
print("First column: ", data_meta.iloc[ : , 0])
Dataframe dimensions: (383, 8)
First column:  Filename
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam    MCF7
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam    MCF7
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam    MCF7
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam      MCF7
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam      MCF7
                                                            ... 
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam    MCF7
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam    MCF7
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam    MCF7
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam    MCF7
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam    MCF7
Name: Cell Line, Length: 383, dtype: object

Let's verify that there are no duplicate cell names in the data_meta dataset:

In [3]:
names = [i for i in data_meta["Cell name"]]
assert len(names) == len(set(names))
In [4]:
data_meta.head()
Out[4]:
Cell Line Lane Pos Condition Hours Cell name PreprocessingTag ProcessingComments
Filename
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam MCF7 output.STAR.1 A10 Hypo 72 S28 Aligned.sortedByCoord.out.bam STAR,FeatureCounts
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam MCF7 output.STAR.1 A11 Hypo 72 S29 Aligned.sortedByCoord.out.bam STAR,FeatureCounts
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam MCF7 output.STAR.1 A12 Hypo 72 S30 Aligned.sortedByCoord.out.bam STAR,FeatureCounts
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam MCF7 output.STAR.1 A1 Norm 72 S1 Aligned.sortedByCoord.out.bam STAR,FeatureCounts
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam MCF7 output.STAR.1 A2 Norm 72 S2 Aligned.sortedByCoord.out.bam STAR,FeatureCounts

Each row represents a cell and the columns are:

  • CELL LINE: MCF7 is a cell line derived from breast tumor tissue, so a population of cells that are derived from a single cell source and have been cultured and proliferated under laboratory conditions;
  • PCR PLATE: the number of the plate that was used to run the PCR reactions for the DNA of that cell (PCA is a technique used in molecular biology to amplify a specific segment of DNA);
  • POS: the position of the cell;
  • CONDITION: Normoxia (normal level of oxygen) or Hypoxia (low level of oxygen)
  • HOURS: time the experiment was carried out;
  • CELL NAME (all the cells are different);
  • PREPROCESSING: info about preprocessing pipeline;
  • PROCESSINGCOMMENTS: comments on processing.
In [5]:
data_meta.describe(include='all')
Out[5]:
Cell Line Lane Pos Condition Hours Cell name PreprocessingTag ProcessingComments
count 383 383 383 383 383.0 383 383 383
unique 1 4 98 2 NaN 383 1 1
top MCF7 output.STAR.1 A10 Norm NaN S28 Aligned.sortedByCoord.out.bam STAR,FeatureCounts
freq 383 96 4 192 NaN 1 383 383
mean NaN NaN NaN NaN 72.0 NaN NaN NaN
std NaN NaN NaN NaN 0.0 NaN NaN NaN
min NaN NaN NaN NaN 72.0 NaN NaN NaN
25% NaN NaN NaN NaN 72.0 NaN NaN NaN
50% NaN NaN NaN NaN 72.0 NaN NaN NaN
75% NaN NaN NaN NaN 72.0 NaN NaN NaN
max NaN NaN NaN NaN 72.0 NaN NaN NaN
In [6]:
print(data_meta.isnull().sum())
for i in data_meta.isnull().sum():
    assert i == 0
Cell Line             0
Lane                  0
Pos                   0
Condition             0
Hours                 0
Cell name             0
PreprocessingTag      0
ProcessingComments    0
dtype: int64

There are no missing values.

Repeating the same steps for HCC1806 SmartSeq experiment (so the same experiment on another cell line), we obtain a similar result but with dataframe dimensions = (243, 8): we have 243 cells with no duplicates and no missing values in the table.

Importing data - MCF7 SmartSeq experiment¶

We read the file with the MCF7 SmartSeq experiment dataset:

In [7]:
data = pd.read_csv("/Users/ela/Documents/AI_LAB/SmartSeq/MCF7_SmartS_Unfiltered_Data.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(data))
print("First column: ", data.iloc[ : , 0])
Dataframe dimensions: (22934, 383)
First column:  "WASH7P"         0
"MIR6859-1"      0
"WASH9P"         1
"OR4F29"         0
"MTND1P23"       0
              ... 
"MT-TE"          4
"MT-CYB"       270
"MT-TT"          0
"MT-TP"          5
"MAFIP"          8
Name: "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam", Length: 22934, dtype: int64

We transpose the original dataframe to have the cells in the rows and the genes as features in the columns. We also remove the double quotes in the features' names to simplify the indexing.

In [8]:
def remove_double_quotes(word):
    return word.replace('"', '')
In [9]:
data = data.rename(columns={"{}".format(i):"{}".format(remove_double_quotes(i)) for i in data.columns})
data = data.T
data = data.rename(columns={"{}".format(i):"{}".format(remove_double_quotes(i)) for i in data.columns})
print("Dataframe dimesions:", np.shape(data))
Dataframe dimesions: (383, 22934)

HCC1806 SmartSeq experiment: we have a dataframe of dimesions: (243, 23396).

Data Structure and Type¶

In [10]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 383 entries, output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam to output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam
Columns: 22934 entries, WASH7P to MAFIP
dtypes: int64(22934)
memory usage: 67.0+ MB
In [11]:
data.head()
Out[11]:
WASH7P MIR6859-1 WASH9P OR4F29 MTND1P23 MTND2P28 MTCO1P12 MTCO2P12 MTATP8P1 MTATP6P1 ... MT-TH MT-TS2 MT-TL2 MT-ND5 MT-ND6 MT-TE MT-CYB MT-TT MT-TP MAFIP
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam 0 0 1 0 0 2 2 0 0 29 ... 0 0 0 505 147 4 270 0 5 8
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam 0 0 0 0 0 0 0 0 0 0 ... 1 1 0 1 0 0 1 0 0 0
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam 0 0 0 0 0 1 1 1 0 12 ... 0 0 0 1 0 0 76 0 0 0
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam 0 0 0 0 0 0 0 0 0 7 ... 1 0 0 44 8 0 66 0 1 0
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam 0 0 0 0 0 0 0 0 0 68 ... 0 0 0 237 31 3 727 0 0 0

5 rows × 22934 columns

In [12]:
print(data.dtypes)
for i in data.dtypes:
    assert i == "int64"
WASH7P       int64
MIR6859-1    int64
WASH9P       int64
OR4F29       int64
MTND1P23     int64
             ...  
MT-TE        int64
MT-CYB       int64
MT-TT        int64
MT-TP        int64
MAFIP        int64
Length: 22934, dtype: object
In [13]:
numeric_columns = data.select_dtypes(include=[np.number]).columns
all_numeric = len(numeric_columns) == len(data.columns)

print(all_numeric)
True

All data are numerical, there are NO categorical data.

In [14]:
desc_table = data.describe()
desc_table
Out[14]:
WASH7P MIR6859-1 WASH9P OR4F29 MTND1P23 MTND2P28 MTCO1P12 MTCO2P12 MTATP8P1 MTATP6P1 ... MT-TH MT-TS2 MT-TL2 MT-ND5 MT-ND6 MT-TE MT-CYB MT-TT MT-TP MAFIP
count 383.000000 383.000000 383.000000 383.00000 383.000000 383.000000 383.000000 383.000000 383.000000 383.000000 ... 383.000000 383.000000 383.000000 383.000000 383.000000 383.000000 383.00000 383.000000 383.000000 383.000000
mean 0.133159 0.026110 1.344648 0.05483 0.049608 6.261097 4.681462 0.524804 0.073107 222.054830 ... 1.060052 0.443864 3.146214 1016.477807 204.600522 5.049608 2374.97389 2.083551 5.626632 1.749347
std 0.618664 0.249286 2.244543 0.31477 0.229143 7.565749 6.232649 0.980857 0.298131 262.616874 ... 1.990566 1.090827 4.265352 1009.444811 220.781927 6.644302 2920.39000 3.372714 7.511180 3.895204
min 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 23.000000 ... 0.000000 0.000000 0.000000 172.000000 30.500000 0.000000 216.50000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.00000 0.000000 3.000000 2.000000 0.000000 0.000000 98.000000 ... 0.000000 0.000000 1.000000 837.000000 152.000000 3.000000 785.00000 0.000000 3.000000 0.000000
75% 0.000000 0.000000 2.000000 0.00000 0.000000 10.000000 7.000000 1.000000 0.000000 370.500000 ... 1.000000 0.000000 5.000000 1549.000000 294.000000 7.000000 4059.00000 3.000000 8.000000 2.000000
max 9.000000 4.000000 20.000000 3.00000 2.000000 45.000000 36.000000 6.000000 2.000000 1662.000000 ... 15.000000 8.000000 22.000000 8115.000000 2002.000000 46.000000 16026.00000 22.000000 56.000000 32.000000

8 rows × 22934 columns

In [15]:
print("Global max is:", desc_table.loc["max"].max())
print("Global min is:", desc_table.loc["min"].min())
Global max is: 190556.0
Global min is: 0.0

At first glance, the dataset seems to have a lot of 0 entries and some big numbers (outliers). We will deal with sparsity and outliers in the next sections.

In [16]:
print(data.isnull().sum())
for i in data.isnull().sum():
    assert i == 0
WASH7P       0
MIR6859-1    0
WASH9P       0
OR4F29       0
MTND1P23     0
            ..
MT-TE        0
MT-CYB       0
MT-TT        0
MT-TP        0
MAFIP        0
Length: 22934, dtype: int64

There are no missing values.

HCC1806 SmartSeq experiment: same results but with 210944.0 as global max.

Duplicate genes¶

The duplicated() function in pandas is used to identify duplicate rows in a DataFrame or Series. To see which genes are redundant, we take data.T.duplicated() that returns a boolean Series indicating which rows of data.T (columns of data) are duplicated.

In [17]:
duplicate_data = data.T[data.T.duplicated()]
print("Number of duplicate genes:", duplicate_data.shape[0], "over", data.T.shape[0])
print("Percentage of duplicate genes:", (duplicate_data.shape[0] * 100) / (data.T.shape[0]), "%")
Number of duplicate genes: 29 over 22934
Percentage of duplicate genes: 0.12644981250545043 %

Since we have duplicate genes, we need to understand which ones are equal to each other. To do so, we use a correlation matrix of duplicate genes.

In [18]:
duplicate_rows_df_t = duplicate_data.T
duplicate_rows_df_t
c_dupl = duplicate_rows_df_t.corr()
c_dupl
Out[18]:
KLF2P3 UGT1A9 SLC22A14 COQ10BP2 LAP3P2 GALNT17 PON1 MIR664B KCNS2 MIR548D1 ... RBFOX1 ASPA BCL6B CCL3L1 OTOP3 RNA5SP450 PSG1 MIR3191 SEZ6L ADAMTS5
KLF2P3 1.000000 -0.014798 -0.008333 -0.008333 -0.032300 -0.007903 -0.007903 -0.007142 -0.008333 -0.007903 ... -0.008333 -0.007903 -0.008333 -0.012088 -0.006928 -0.008333 -0.008333 -0.008333 -0.006775 -0.007903
UGT1A9 -0.014798 1.000000 -0.009322 -0.009322 -0.008675 -0.008841 -0.008841 -0.007990 -0.009322 -0.008841 ... -0.009322 -0.008841 -0.009322 -0.013523 -0.007750 -0.009322 -0.009322 -0.009322 -0.007579 -0.008841
SLC22A14 -0.008333 -0.009322 1.000000 0.497375 -0.020348 0.948434 0.948434 -0.004499 0.497375 -0.004979 ... 0.497375 0.630630 0.497375 -0.007615 0.831379 -0.005249 0.497375 -0.005249 0.813013 0.630630
COQ10BP2 -0.008333 -0.009322 0.497375 1.000000 -0.020348 0.630630 0.630630 -0.004499 0.497375 -0.004979 ... 0.497375 0.630630 0.497375 -0.007615 0.134926 -0.005249 0.497375 -0.005249 0.112487 0.630630
LAP3P2 -0.032300 -0.008675 -0.020348 -0.020348 1.000000 -0.019299 -0.019299 -0.017440 -0.020348 -0.019299 ... -0.020348 -0.019299 -0.020348 -0.013474 -0.016917 -0.020348 -0.020348 0.118817 -0.016543 -0.019299
GALNT17 -0.007903 -0.008841 0.948434 0.630630 -0.019299 1.000000 1.000000 -0.004267 0.630630 -0.004722 ... 0.630630 0.799056 0.630630 -0.007222 0.612365 -0.004979 0.630630 -0.004979 0.586533 0.799056
PON1 -0.007903 -0.008841 0.948434 0.630630 -0.019299 1.000000 1.000000 -0.004267 0.630630 -0.004722 ... 0.630630 0.799056 0.630630 -0.007222 0.612365 -0.004979 0.630630 -0.004979 0.586533 0.799056
MIR664B -0.007142 -0.007990 -0.004499 -0.004499 -0.017440 -0.004267 -0.004267 1.000000 -0.004499 -0.004267 ... -0.004499 -0.004267 -0.004499 0.007958 -0.003741 -0.004499 -0.004499 -0.004499 -0.003658 -0.004267
KCNS2 -0.008333 -0.009322 0.497375 0.497375 -0.020348 0.630630 0.630630 -0.004499 1.000000 -0.004979 ... 0.497375 0.630630 1.000000 0.021357 0.134926 -0.005249 0.497375 -0.005249 0.112487 0.948434
MIR548D1 -0.007903 -0.008841 -0.004979 -0.004979 -0.019299 -0.004722 -0.004722 -0.004267 -0.004979 1.000000 ... -0.004979 -0.004722 -0.004979 -0.007222 -0.004139 -0.004979 -0.004979 -0.004979 -0.004048 -0.004722
STRA6LP 0.173251 0.031704 -0.022850 0.001061 0.050458 -0.021672 -0.021672 -0.015486 -0.022850 0.069042 ... -0.022850 -0.021672 -0.022850 0.034205 -0.018997 -0.022850 -0.022850 0.096707 -0.018577 -0.021672
MUC6 -0.007656 -0.008565 0.654887 0.654887 -0.018695 0.829681 0.829681 -0.004134 0.654887 -0.004574 ... 0.654887 0.829681 0.654887 -0.006996 0.178813 -0.004823 0.654887 -0.004823 0.149322 0.829681
LINC00595 -0.012177 -0.013622 -0.007671 -0.007671 -0.029734 -0.007275 -0.007275 -0.006575 -0.007671 -0.007275 ... -0.007671 -0.007275 -0.007671 -0.011127 -0.006377 -0.007671 -0.007671 -0.007671 -0.006237 -0.007275
CACYBPP1 -0.008333 -0.009322 -0.005249 -0.005249 -0.020348 -0.004979 -0.004979 -0.004499 -0.005249 -0.004979 ... -0.005249 -0.004979 -0.005249 -0.007615 -0.004364 -0.005249 -0.005249 -0.005249 -0.004268 -0.004979
KNOP1P1 -0.011158 -0.012483 -0.007029 -0.007029 -0.027247 -0.006667 -0.006667 -0.006025 -0.007029 -0.006667 ... -0.007029 -0.006667 -0.007029 0.031184 -0.005844 -0.007029 -0.007029 -0.007029 -0.005715 -0.006667
WDR95P -0.007903 -0.008841 0.948434 0.312826 -0.019299 0.799056 0.799056 -0.004267 0.312826 -0.004722 ... 0.312826 0.397167 0.312826 -0.007222 0.964653 -0.004979 0.312826 -0.004979 0.955646 0.397167
MIR19B1 -0.007903 -0.008841 -0.004979 -0.004979 0.156686 -0.004722 -0.004722 -0.004267 -0.004979 -0.004722 ... -0.004979 -0.004722 -0.004979 -0.007222 -0.004139 -0.004979 -0.004979 -0.004979 -0.004048 -0.004722
RNU6-539P -0.008333 -0.009322 -0.005249 -0.005249 -0.020348 -0.004979 -0.004979 -0.004499 -0.005249 -0.004979 ... -0.005249 -0.004979 -0.005249 -0.007615 -0.004364 -0.005249 -0.005249 -0.005249 -0.004268 -0.004979
SNURF -0.008333 -0.009322 -0.005249 -0.005249 -0.020348 -0.004979 -0.004979 -0.004499 -0.005249 -0.004979 ... -0.005249 -0.004979 -0.005249 -0.007615 -0.004364 -0.005249 -0.005249 -0.005249 -0.004268 -0.004979
RBFOX1 -0.008333 -0.009322 0.497375 0.497375 -0.020348 0.630630 0.630630 -0.004499 0.497375 -0.004979 ... 1.000000 0.630630 0.497375 -0.007615 0.134926 -0.005249 0.497375 -0.005249 0.112487 0.630630
ASPA -0.007903 -0.008841 0.630630 0.630630 -0.019299 0.799056 0.799056 -0.004267 0.630630 -0.004722 ... 0.630630 1.000000 0.630630 -0.007222 0.172005 -0.004979 0.630630 -0.004979 0.143597 0.799056
BCL6B -0.008333 -0.009322 0.497375 0.497375 -0.020348 0.630630 0.630630 -0.004499 1.000000 -0.004979 ... 0.497375 0.630630 1.000000 0.021357 0.134926 -0.005249 0.497375 -0.005249 0.112487 0.948434
CCL3L1 -0.012088 -0.013523 -0.007615 -0.007615 -0.013474 -0.007222 -0.007222 0.007958 0.021357 -0.007222 ... -0.007615 -0.007222 0.021357 1.000000 -0.006331 -0.007615 -0.007615 -0.007615 -0.006191 0.011096
OTOP3 -0.006928 -0.007750 0.831379 0.134926 -0.016917 0.612365 0.612365 -0.003741 0.134926 -0.004139 ... 0.134926 0.172005 0.134926 -0.006331 1.000000 -0.004364 0.134926 -0.004364 0.999479 0.172005
RNA5SP450 -0.008333 -0.009322 -0.005249 -0.005249 -0.020348 -0.004979 -0.004979 -0.004499 -0.005249 -0.004979 ... -0.005249 -0.004979 -0.005249 -0.007615 -0.004364 1.000000 -0.005249 -0.005249 -0.004268 -0.004979
PSG1 -0.008333 -0.009322 0.497375 0.497375 -0.020348 0.630630 0.630630 -0.004499 0.497375 -0.004979 ... 0.497375 0.630630 0.497375 -0.007615 0.134926 -0.005249 1.000000 -0.005249 0.112487 0.630630
MIR3191 -0.008333 -0.009322 -0.005249 -0.005249 0.118817 -0.004979 -0.004979 -0.004499 -0.005249 -0.004979 ... -0.005249 -0.004979 -0.005249 -0.007615 -0.004364 -0.005249 -0.005249 1.000000 -0.004268 -0.004979
SEZ6L -0.006775 -0.007579 0.813013 0.112487 -0.016543 0.586533 0.586533 -0.003658 0.112487 -0.004048 ... 0.112487 0.143597 0.112487 -0.006191 0.999479 -0.004268 0.112487 -0.004268 1.000000 0.143597
ADAMTS5 -0.007903 -0.008841 0.630630 0.630630 -0.019299 0.799056 0.799056 -0.004267 0.948434 -0.004722 ... 0.630630 0.799056 0.948434 0.011096 0.172005 -0.004979 0.630630 -0.004979 0.143597 1.000000

29 rows × 29 columns

In [19]:
data_noDup = data.T.drop_duplicates(inplace=False)
data_noDup.T
Out[19]:
WASH7P MIR6859-1 WASH9P OR4F29 MTND1P23 MTND2P28 MTCO1P12 MTCO2P12 MTATP8P1 MTATP6P1 ... MT-TH MT-TS2 MT-TL2 MT-ND5 MT-ND6 MT-TE MT-CYB MT-TT MT-TP MAFIP
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam 0 0 1 0 0 2 2 0 0 29 ... 0 0 0 505 147 4 270 0 5 8
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam 0 0 0 0 0 0 0 0 0 0 ... 1 1 0 1 0 0 1 0 0 0
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam 0 0 0 0 0 1 1 1 0 12 ... 0 0 0 1 0 0 76 0 0 0
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam 0 0 0 0 0 0 0 0 0 7 ... 1 0 0 44 8 0 66 0 1 0
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam 0 0 0 0 0 0 0 0 0 68 ... 0 0 0 237 31 3 727 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam 0 0 0 0 0 0 1 0 0 49 ... 0 0 1 341 46 1 570 0 0 0
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam 0 0 1 0 0 2 5 5 0 370 ... 0 0 2 1612 215 6 3477 3 7 6
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam 1 0 1 0 0 7 0 0 0 33 ... 0 0 0 62 20 0 349 0 2 0
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam 0 0 4 1 0 29 4 0 0 228 ... 3 0 2 1934 575 7 2184 2 28 1
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam 1 0 5 0 0 5 3 0 0 71 ... 5 2 3 1328 490 4 1149 2 11 4

383 rows × 22905 columns

In [20]:
data_noDup.T.shape
Out[20]:
(383, 22905)
In [21]:
assert ((data.shape[1] - data_noDup.T.shape[1]) == duplicate_data.shape[0])
In [22]:
data = data_noDup.T

HCC1806 SmartSeq experiment: the umber of duplicate genes is 54 over 23396. Therefore, after we remove the duplicates, the shape of tha dataset will be (243, 23342).

Correlation between cells¶

To study the correlation between different cells, we compute the correlation matrix and plot a heatmap in order to visualize it.

In [23]:
plt.figure(figsize=(10,8))
c= data.T.corr()                   # it computes the correlation between the columns of data.T (the cells)
midpoint = (c.values.max() - c.values.min()) /2 + c.values.min()              # calculates the average correlation value between the expression profiles of cells (find the maximum and minimum correlation values in the c matrix and computes the average of these two values)
sns.heatmap(c,cmap='coolwarm', center=0)               # correlation matrix c as input and applies the colormap 'coolwarm'. The center=0 argument sets the midpoint of the colormap at zero, so positive and negative correlations are shown with different colors
print("Number of cells included: ", np.shape(c))
print("Average correlation of expression profiles between cells: ", midpoint)
print("Min. correlation of expression profiles between cells: ", c.values.min())
Number of cells included:  (383, 383)
Average correlation of expression profiles between cells:  0.49898217617448165
Min. correlation of expression profiles between cells:  -0.002035647651036618

Looking at the previous map, we can notice that some cells have very low correlation values. We now try to further investigate why this happens.

We first visualize the previous plot only for some of the cells, in order to select two cells that have low correlation values with all others and two of them that show high correlation values.

In [24]:
data_subset = data.iloc[:20, :]
In [25]:
plt.figure(figsize=(10,8))
c= data_subset.T.corr()                   # it computes the correlation between the columns of data.T (the cells)
midpoint = (c.values.max() - c.values.min()) /2 + c.values.min()              # calculates the average correlation value between the expression profiles of cells (find the maximum and minimum correlation values in the c matrix and computes the average of these two values)
sns.heatmap(c,cmap='coolwarm', center=0)               # correlation matrix c as input and applies the colormap 'coolwarm'. The center=0 argument sets the midpoint of the colormap at zero, so positive and negative correlations are shown with different colors
print("Number of cells included: ", np.shape(c))
Number of cells included:  (20, 20)
In [26]:
data_subset_1 = data.iloc[30:60, :]
In [27]:
plt.figure(figsize=(10,8))
c= data_subset_1.T.corr()                   # it computes the correlation between the columns of data.T (the cells)
midpoint = (c.values.max() - c.values.min()) /2 + c.values.min()              # calculates the average correlation value between the expression profiles of cells (find the maximum and minimum correlation values in the c matrix and computes the average of these two values)
sns.heatmap(c,cmap='coolwarm', center=0)               # correlation matrix c as input and applies the colormap 'coolwarm'. The center=0 argument sets the midpoint of the colormap at zero, so positive and negative correlations are shown with different colors
print("Number of cells included: ", np.shape(c))
Number of cells included:  (30, 30)
In [28]:
# Cells WITHOUT correlation
cell_1_nocorr = 'output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam'
cell_2_nocorr = 'output.STAR.1_D8_Hypo_S170_Aligned.sortedByCoord.out.bam'

# Cells WITH correlation
cell_3_corr = 'output.STAR.1_C4_Norm_S100_Aligned.sortedByCoord.out.bam'
cell_4_corr = 'output.STAR.4_B4_Norm_S70_Aligned.sortedByCoord.out.bam'

Let's try to visualize their gene expression through their violin plots:

In [29]:
sns.violinplot(x=data.loc[cell_1_nocorr])
plt.show()
In [30]:
sns.violinplot(x= data.loc[cell_2_nocorr])
plt.show()
In [31]:
sns.violinplot(x= data.loc[cell_3_corr])
plt.show()
In [32]:
sns.violinplot(x= data.loc[cell_4_corr])
plt.show()
In [33]:
row1_values = data.loc[cell_1_nocorr]
row2_values = data.loc[cell_2_nocorr]
row3_values = data.loc[cell_3_corr]
row4_values = data.loc[cell_4_corr]

# Step 4: Create a new DataFrame using the selected rows
elem = pd.DataFrame({ 'Cell 1 WITHOUT correlation': row1_values, 'Cell 2 WITHOUT correlation': row2_values, 'Cell 3 WITH correlation': row3_values, 'Cell 4 WITH correlation': row4_values})

# Step 5: Plot the violin plot

plt.figure(figsize=(16,4))
sns.violinplot(data=elem)
plt.show()

After the visualization of the plots, we can deduce that the cells that show no correlation at all are the ones that express few genes. We will need to remove them later.

Outliers¶

Let's try identify the number of outliers and the percentage over the total, assuming that an outlier falls in the 20th quantile (above 95%)

To find the outliers, we compute the 25th percentile Q1 (the value below which 25% of the data falls) and the 75th percentile Q3 (the value below which 75% of the data falls) for each column and find the interquartile range, that is a measure of the spread of the middle 50% of the data used to identify outliers.

In [34]:
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
WASH7P          0.0
MIR6859-1       0.0
WASH9P          2.0
OR4F29          0.0
MTND1P23        0.0
              ...  
MT-TE           7.0
MT-CYB       3842.5
MT-TT           3.0
MT-TP           8.0
MAFIP           2.0
Length: 22905, dtype: float64
In [35]:
IQR.value_counts()
Out[35]:
0.0       10616
1.0         606
2.0         354
3.0         281
4.0         251
          ...  
373.5         1
324.0         1
572.0         1
692.5         1
3842.5        1
Length: 992, dtype: int64

We can see that many genes have an interquartile range of 0.

In [36]:
data_noOut = data[~((data < (Q1 - 1.5 * IQR)) |(data > (Q3 + 1.5 * IQR))).any(axis=1)]
print(data_noOut.shape)
(4, 22905)

Considering as outliers the elements of value beyond 1.5 * IQR, if we remove any of these rows/cells, we will remove all values below the first quartile (Q1) or above the third quartile (Q3), so we will obtain a dataset with very few remaining datapoints. We thus should proceed in another way.

HCC1806 SmartSeq experiment: we would obtain a resulting dataframe of dimensions (0, 23342), therefore an empty one.

We could try to compute the IQR of each row of the dataset: we transpose the dataset and proceed as above.

In [37]:
dataT = data.T
Q1 = dataT.quantile(0.25)
Q3 = dataT.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam    17.0
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam     0.0
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam     5.0
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam       0.0
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam       7.0
                                                            ... 
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam     9.0
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam    27.0
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam    30.0
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam    38.0
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam    33.0
Length: 383, dtype: float64
In [38]:
IQR.value_counts()
Out[38]:
0.0     38
33.0    16
35.0    15
34.0    14
28.0    12
31.0    12
2.0     12
17.0    11
32.0    11
25.0    10
18.0    10
29.0    10
42.0     9
14.0     9
27.0     9
11.0     8
39.0     8
1.0      8
9.0      8
38.0     8
30.0     7
45.0     7
21.0     7
23.0     7
37.0     7
13.0     7
19.0     7
26.0     7
3.0      6
15.0     6
8.0      6
5.0      6
22.0     5
36.0     5
7.0      5
20.0     5
4.0      5
6.0      5
10.0     4
40.0     4
43.0     4
41.0     4
16.0     4
44.0     4
12.0     3
24.0     3
47.0     2
46.0     1
50.0     1
49.0     1
dtype: int64
In [39]:
data_noOut_T = dataT[~((dataT < (Q1 - 1.5 * IQR)) |(dataT > (Q3 + 1.5 * IQR))).any(axis=1)]
data_noOut = data_noOut_T.T
print(data_noOut.shape)
(383, 6424)
In [40]:
print("Difference of number of columns:", data.shape[1]-data_noOut.shape[1])
Difference of number of columns: 16481
In [41]:
print("Percentage of removed columns:", (data.shape[1]-data_noOut.shape[1])/data.shape[1]*100, "%")
Percentage of removed columns: 71.9537218947828 %

If we remove the rows that have outlier values in any column in the transposed dataset, so any column of the original dataset (gene) that has values that are more than 1.5 times the interquartile range (IQR) below the first quartile (Q1) or above the third quartile (Q3), we obtain a final dataframe with 6424 columns. We therefore removed a total of 16481 genes, more than 70% of the genes.

HCC1806 SmartSeq experiment: percentage of removed columns of 53.77431239825208 %, which is still very high.

It is important to notice that outliers are to be treated very carefully in this case. An observation expresses the RNA sequencing counts, therefore a very high count value should not be treated as an error, but rather as an important indicator. We will investigate in one of the following sections whether removing outliers improves our results.

Visualizing the data¶

Let's try to gain more information about the dataset and how to treat the outliers. We do violin plots for some cell singularly (and randomly).

In [42]:
rows = list(data.index)
In [43]:
random.seed(88)
In [44]:
ind1 = random.randint(0,243)
print(ind1)
sns.violinplot(x= data.loc[rows[ind1]])
plt.show()
101
In [45]:
ind2 = random.randint(0,243)
print(ind2)
sns.violinplot(x= data.loc[rows[ind2]])
plt.show()
48

We can notice that, taking two randomly chosen cells, they have very similar plots:

  • most of the values are concentrated around zero
  • most of the values are relatively small, i.e. their order of magnitude is less than $10^3$

However, maxima are considerably different.

Now we do a violin plot with all the cells. To better viualize it we use as xticks labels the cell name attribute of each row (saved in the meta data).

For 50 cells:

In [46]:
names = [i for i in data_meta["Cell name"]]      # get the cell name attribute of each cell

data_small = data.T.iloc[:, :50]  
names_small = names[:50]                # select 50 cells
plt.figure(figsize=(16,4))

plot=sns.violinplot(data=data_small, palette="Set3", cut=0)
plot.set_xticklabels(names_small, rotation=90, fontsize=6)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()

Going back to the issue of outliers, let's plot the first 50 cells of the dataset without outliers:

In [47]:
data_noOut_small = data_noOut.T.iloc[:, :50]
names_small = names[:50]
plt.figure(figsize=(16,4))

plot=sns.violinplot(data=data_noOut_small, palette="Set3", cut=0)
plot.set_xticklabels(names_small, rotation=90, fontsize=6)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()

Let's visualize the plot of the previously randomly chosen single cells excluding outliers:

In [48]:
print(ind1)
sns.violinplot(x= data_noOut.loc[rows[ind1]])
plt.show()
101
In [49]:
print(ind2)
sns.violinplot(x= data_noOut.loc[rows[ind2]])
plt.show()
48

From the previous plots we can deduce that, if we remove the outliers, the maxima take lower values but still we have a big amount of zeroes.

We can deduce that the dataset is sparse. Let's analyze this concept more in detail.

HCC1806 SmartSeq experiment: similar results and same conclusion.

Sparsity of data¶

Sparsity means that the matrix contains many zero values.

We can try to quantify sparsity of the dataset, calculating the proportion of zero values in the gene expression matrix as:

sparsity = (number of zeros) / (total number of elements in the matrix)

Let's compute this sparsity index for the original dataset:

In [50]:
n_zeros = np.count_nonzero(data==0)                # count the number of elements in the boolean mask (data == 0) that are true, so the number of 0 elements
print('Number of 0 values in the matrix:', n_zeros)

sp = n_zeros / data.size
print('Sparsity index:', sp*100, '%')
Number of 0 values in the matrix: 5278229
Sparsity index: 60.16711094696393 %

In the dataset without outliers, we obtain:

In [51]:
n_zeros_noout = np.count_nonzero(data_noOut==0)                # count the number of elements in the boolean mask (data == 0) that are true, so the number of 0 elements
print('Number of 0 values in the matrix:', n_zeros_noout)

sp_noout = n_zeros_noout / data_noOut.size
print('Sparsity index:', sp_noout*100, '%')
Number of 0 values in the matrix: 2349283
Sparsity index: 95.48409359159028 %

We can see that removing outliers is not a good idea, since sparsity is even higher than before.

HCC1806 SmartSeq experiment: the sparsity index of the original dataset is about 55.8 % and the one of the dataset without outliers is about 86.6 %, so the same conclusion holds.

Sparse data may lead to several problems for training a machine learning model (like over-fitting, lower performance of the models, etc.), so they should be handled properly.

Even considering the original dataset, the sparsity index underlines that more than half of the elements in the matrix is equal to 0, so the dataset is sparse.

Using sparse matrix representation can be advantageous in cases where the data is sparse. Indeed, sparse matrices only store the non-zero values in the matrix, which can lead to significant memory savings. But here our main problem is not memory saving, so we can try to work with the dense representation.

There are several ways to adress the sparsity problem when training a machine learning model on gene expression data.

For instance, we could employ dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce the dimensionality of the data while preserving the most important features.

We employ PCA to adress the sparsity problem in the next sections.

Distribution of the data¶

To examine the distribution of the dataset, we look at Skewness and Kurtosis of the gene expression profiles.

Skewness measures the degree of asymmetry in the distribution. A distribution is said to be skewed if it is not symmetric around its mean.

Kurtosis measures the degree of peakedness or flatness in the distribution, so of how heavy the tails are.

We will use the scipy.stats module to calculate the skewness and kurtosis of each column in a data.T, so each row (cell) of hcc:

In [52]:
from scipy.stats import kurtosis, skew

cnames = list(data.T.columns)

colN = np.shape(data.T)[1]
data_skew_cells = []
for i in range(colN) :    
    name = data.T[cnames[i]]
    data_skew_cells += [skew(name)]
sns.histplot(data_skew_cells,bins=100)
plt.xlabel('Skewness of single cells expression profiles - original df')
plt.show()
In [53]:
print( "Skewness of data: ", data_skew_cells)
print("Mean of skewness values:", np.mean(data_skew_cells))
Skewness of data:  [65.3293476728411, 38.73257818368301, 48.14055338427522, 25.51111003985754, 61.807162316617756, 67.04123335084539, 36.590323800613746, 71.0267787850064, 46.968455285544245, 50.700986376200646, 62.01708428399256, 48.47588656194911, 43.95274295898324, 44.84104723244163, 45.212925257275465, 78.61404436439186, 58.977735156829084, 21.916000765107164, 73.77516658973012, 59.01333707783532, 66.18273458292978, 58.92877893459357, 44.94185124605908, 103.80162853428926, 57.85913666503987, 52.61678597578867, 79.95809503472208, 29.797032951121288, 75.83878274045708, 62.732518968901765, 58.931309180084575, 56.32305724868327, 57.02376162042931, 57.37104819661485, 35.84361391791805, 69.65974750539775, 51.515414034789075, 50.39866420239922, 38.03240678913514, 63.80300193632892, 55.837467193115366, 50.10913881415591, 65.46704067028153, 35.08184883135755, 39.332180695995035, 57.596025526774575, 82.7008853973779, 61.19103247245235, 52.774568256734895, 68.02944491194704, 38.53118067620604, 28.588176975726686, 57.00620049315085, 63.64230198109848, 79.5814433438006, 69.39567916304566, 53.10027574801472, 49.16127203646519, 47.0829797328888, 62.772508037507386, 19.35792192037944, 62.748039864834105, 52.968565526844934, 68.34729685346215, 58.81258441759759, 49.82857658687345, 59.87512984436019, 58.25736378796296, 28.484880862403536, 56.254329732040226, 56.07399539571472, 80.76084961070782, 52.46314466682449, 47.56013553643926, 41.93952365975386, 48.41525226850443, 66.45470740184908, 23.430231658046488, 52.59660511284909, 26.996445100615517, 88.72384645247065, 57.73638014488098, 51.67815606794171, 49.99290619058036, 58.91165399779514, 57.62940065991728, 86.87385559376973, 151.33406769019496, 63.43405396480639, 67.84355562871343, 51.23046612192652, 32.22013850139017, 69.35951824561127, 46.783572043105494, 49.62970467475006, 61.30010672793698, 56.55120164973365, 47.72257754502327, 41.13858434684936, 70.61539961968228, 48.457492919999616, 64.81169245584256, 64.18147361625405, 61.18360142189198, 71.55326665580401, 40.97540051997736, 37.090630098898394, 65.33448071876276, 39.503515067138935, 48.45693266453451, 44.87292710996915, 57.05418158044058, 45.43465346616777, 77.41202888313158, 63.76901181667419, 34.36351904649953, 55.38381935661642, 48.33926740746969, 92.23731195460935, 69.28370745364008, 57.81637890874989, 55.96700084744578, 40.92502099919537, 67.07014242268563, 58.144113642609085, 74.99811120807789, 75.51163013852279, 67.48424289731418, 62.19081936604902, 55.21105149409568, 52.08741382622933, 65.43988375973596, 63.244760618906305, 65.92890442681333, 43.070529413597185, 62.75642336224687, 49.88366068684939, 81.94118870168232, 78.77161002973799, 55.00290504417747, 73.07432615063674, 55.23996981794865, 34.18237073016225, 55.281076882684786, 48.91830707689147, 43.60985448206934, 49.74136703691775, 86.82210526917677, 50.20629767794813, 45.28184443992491, 71.12107067316758, 49.150796466940506, 33.68738010638049, 57.39347522014365, 45.52613634560229, 58.542644230781406, 54.92728433790341, 41.642345566087705, 49.89789130011741, 63.18998861435136, 52.03625521757694, 37.729818290089334, 27.53423453272267, 62.03878036762236, 55.85119214512102, 53.044524771038354, 50.42496058788768, 45.99718633306048, 45.625165598140384, 47.573358236968716, 48.76164952526791, 70.2482386647605, 65.79841483303738, 41.997775287411756, 60.68684882596332, 70.86052280096243, 67.69556128113159, 53.045714746235525, 62.544549883870815, 35.235716374917104, 49.48740432116383, 71.58145983827528, 53.52661726255076, 43.99946345530929, 52.13854786125857, 47.97016856600509, 71.37906051323417, 38.76801202337355, 52.51851262701272, 101.94156548640015, 74.62936306206454, 41.56100743024018, 41.04854658446184, 69.19704343287377, 55.482292610340934, 47.25928489104872, 50.59742956337423, 67.04990393240291, 62.98667381046708, 55.77599367565253, 45.18407378137543, 53.868376733327196, 47.98521307581243, 58.886993180664476, 44.406854835903296, 46.66735387869767, 54.421230993529896, 64.99077419616977, 50.809752914776915, 53.305455105735504, 42.44164975643002, 76.28616849400905, 30.513831906406917, 52.072350578753515, 46.57930393283949, 66.12570371672652, 90.0120279458302, 51.80056717707419, 54.02760799197032, 80.37132968273718, 56.188572523463975, 33.170347102801685, 83.02188343365239, 40.24635237895199, 49.608195703718195, 53.81382356664251, 66.95755855926298, 51.577004656549796, 36.6315107238475, 59.712343377393665, 59.81952216900437, 38.9174653994669, 37.77615469598503, 25.589032769036518, 72.3552385117538, 58.03440017648565, 63.15715812028286, 49.7693185111637, 50.03068560842177, 58.85206988408884, 77.2383483553717, 82.21980739654236, 56.51803116621448, 47.70105460048302, 69.77039875138436, 113.13264231357152, 65.14209478323767, 38.79209286159597, 55.60092901557804, 33.587930551375, 42.69652622843581, 49.14254323710053, 43.46646574962866, 39.627744544153444, 43.335485500145026, 49.45431856299877, 42.08273168383305, 51.44922158945097, 63.142832024553655, 71.91338373615913, 71.95698020141732, 47.885089295025054, 68.89190992918387, 58.19396494713139, 72.20759519014264, 51.48779179277785, 44.39415641321146, 37.31009703960305, 53.37008873321303, 128.2837228531314, 40.57255244500706, 57.309762568619874, 61.176446393595256, 50.70615786100201, 40.81259361962592, 44.90484418596253, 54.90719933156673, 50.774340907341504, 52.06327035931143, 71.81452996797563, 35.97333358798076, 44.55485680209889, 68.52589829388387, 51.430597409341594, 50.54153766138653, 42.7743100791412, 80.49210280803489, 40.03164103558266, 51.302607041675266, 66.030767672662, 30.23828821108307, 46.97286764265703, 63.27425599547708, 52.41015628352887, 45.216424084882746, 9.855202939473623, 44.76084600263007, 58.55135643548786, 47.96201262500501, 66.63060944371539, 45.21249358312084, 57.92877525822537, 52.14043112806137, 52.479675277476204, 34.85791433254715, 60.27744583614408, 50.6893633089063, 35.71475156968971, 38.816912906101436, 41.59598243398502, 65.01595386209534, 66.26284364942425, 41.42238430550905, 73.03452633982722, 51.63442688841827, 49.48355176127539, 54.25555823748036, 53.22105924461813, 61.44495881139293, 57.166805028938576, 58.984914742852354, 48.41275353475184, 40.34489211874943, 48.85963753249201, 79.44203610898319, 48.26989126635675, 47.61550962075728, 68.78482209870575, 60.01468411323561, 61.734478070399, 80.54560765683946, 67.82918177987547, 65.19070287784007, 42.650536294401874, 58.21857916068947, 43.91550533086219, 50.70822991047499, 45.8530948867382, 58.9007258602306, 42.10600918245145, 76.63725182212742, 39.8324226289307, 77.11833738811501, 49.89117093047586, 43.613747210802146, 58.482387972493, 46.27310124868976, 49.94790905529416, 60.9853413876058, 70.78177899044664, 47.00634883327633, 76.85923290456891, 53.72367495857801, 46.739669205914986, 76.34138532027994, 60.21598500266795, 64.04104764679742, 41.56208073514063, 41.5842396485268, 68.11466755330545, 68.00477209879008, 51.334030790956184, 54.091429856453026, 28.81039550302069, 48.742414132421835, 58.83617413941023, 50.77734148088826, 72.56432965207068, 53.926899678862064, 57.21605880930854, 47.70302257650459, 49.33473346404864, 54.20737574625473, 55.85075876727569, 54.45153322982945, 84.38903329203136, 73.83782191673744, 48.55240272287677, 74.35852310510634, 45.46362806672874, 42.05736916684996, 47.96387486528263, 55.998763824447785]
Mean of skewness values: 55.56539160251531
In [54]:
data_kurt_cells = []
for i in range(colN) :     
    name = data.T[cnames[i]]
    data_kurt_cells += [kurtosis(name)]
sns.histplot(data_kurt_cells, bins=100)
plt.xlabel('Kurtosis of single cells expression profiles - original df')
plt.show()
In [55]:
print( "Excess kurtosis of data distribution: ",  data_kurt_cells)
print("Mean of kurtosis values:", np.mean(data_kurt_cells))
Excess kurtosis of data distribution:  [5463.645282643022, 1995.8520052733418, 2901.798051720723, 917.7893553704595, 4656.550578232189, 5580.3195682340875, 2010.9651813965895, 6219.628279280727, 2990.67999138626, 2940.5931578284826, 4592.719301484182, 2627.0498838917624, 2217.448595585073, 2356.356087133049, 2473.511712027504, 7973.608919171623, 4313.571814924348, 799.5354281945483, 6299.13993323815, 4649.269380039665, 5777.840053343645, 3807.887574395473, 2455.9004110862716, 11869.602376301447, 4005.2681270595376, 3504.796075053494, 8055.60069416592, 1308.7672887695835, 6931.3347328174195, 5124.5504128334205, 4615.4316484638475, 4124.508712212333, 4177.310501731559, 3997.8462925983804, 1765.1957149921402, 6856.335494083642, 3107.9509626406248, 2864.8863034575156, 1914.8864839697094, 4998.6709103247695, 4158.328482025339, 3254.8422508530657, 5205.227522488022, 1888.1523758896278, 2312.149738454117, 4020.478021860082, 7434.874527501197, 4223.805454645823, 3561.3577646370686, 5434.304721270646, 1864.0934617869314, 1390.4780672023965, 4295.401594094751, 5224.965021123387, 7280.033240705879, 5734.1372804418, 3710.1110312866977, 2775.454269657457, 2663.638557831071, 4665.859690356973, 497.3704841811498, 5107.989280090836, 3389.8835114473163, 5504.961530308903, 5057.028578365861, 3132.5918695287983, 4940.824456901505, 4312.615332582832, 1233.9664631191279, 3619.1998750662406, 3812.8676147101846, 8515.664727014871, 3478.932404791178, 2856.325039229736, 1756.9236448070544, 2977.711481441854, 5193.040337855706, 807.8196458251136, 3669.64560322169, 1075.978660770195, 8784.423618701825, 3850.0976971364717, 3319.8071727103033, 2869.845189461686, 3984.67665982721, 3981.2591318765144, 8845.231023160475, 22900.00004366052, 4881.9806268433085, 5620.04733495222, 3219.0773902416886, 1036.1373250487657, 6019.742985602328, 2670.1115981833523, 3092.339365179215, 4119.185789468242, 3979.610995282692, 2843.5353900207965, 2170.343327619934, 6236.434101369996, 3227.57035551919, 5320.835337685486, 5275.1235674631525, 4866.735297166459, 6305.186953678495, 2212.522949399944, 1585.466191225797, 5234.486170847774, 2246.4056763200015, 3098.113817321201, 2480.162192703061, 4094.9748013092767, 3001.7239748848842, 7182.260410138484, 4852.6121214290915, 1811.3474297796356, 3939.109919649032, 2807.3080830274907, 11283.313915736582, 5510.434643871869, 4215.075016747391, 3665.15195579173, 2047.9436793516334, 5487.172076977874, 4420.16443985175, 6900.608390335716, 7043.742036376604, 5464.644610235604, 4952.972930326694, 3424.9810968987895, 3232.616273767296, 5871.633613415097, 4952.237054706167, 5398.904423336076, 2275.5894790688576, 4676.676858305407, 3381.4266934198126, 7839.865173239221, 7409.061118971837, 4082.847786735905, 6618.752084940854, 3583.7467263839626, 1510.3684965275293, 3382.113611231845, 2761.4886480486775, 2444.6872967524328, 3043.4665807225338, 8757.903968972863, 3281.526384659717, 2776.8066052474683, 6230.629943506198, 3373.0046736829477, 1524.771526262693, 3947.4485575372146, 2393.9348139393596, 4333.457032938904, 3526.990594039238, 2270.048684237479, 3377.9473640612136, 5046.44948980189, 3999.8150973761417, 2235.670451971863, 1158.0088627838036, 4959.182696132661, 4195.154566570198, 3311.102308851241, 3040.861240928802, 2493.8408983628074, 2470.8587738742335, 2936.573680682571, 2968.044917423188, 5808.789355497862, 5784.099159259742, 2252.611234807281, 4713.521174620785, 6187.608011680268, 5807.185987206108, 3202.4927997677937, 5169.264707114527, 1658.0507614192088, 3042.585784982708, 6828.8078641901875, 3503.783192788276, 2571.728916989993, 3705.616387860165, 3149.3603078388537, 6364.922732519532, 2199.1406074975916, 3733.1688525442114, 12665.602610634798, 6877.947209967625, 1969.08962536927, 2099.196486462552, 5467.745158979508, 3884.8483027660527, 3047.7267644081076, 3686.508353186244, 5636.403354497779, 5202.090630855094, 4289.575310202222, 2791.260121155426, 3522.752177209856, 2960.6496286759307, 4602.372853072517, 2510.7647908630825, 2641.336995906534, 4622.023844898401, 5544.318793682863, 3418.495740090052, 3808.211213525726, 2585.429093686284, 7235.300601151805, 1433.0507894849848, 3177.4587411738917, 2639.277519366078, 5266.742715892161, 10829.915766777687, 3111.5661584688464, 3565.381672916133, 7691.428110840698, 4301.881939298061, 1579.5749354351344, 8176.498578323929, 2204.0920803263252, 3722.00256638277, 3912.4511812299693, 5813.981681928593, 3083.859114031102, 1634.9183055023975, 4826.395805990639, 4148.19349345752, 2209.439682405443, 2057.900958062454, 1226.950231677343, 6868.6150172985535, 4437.7595206463575, 5017.62059882919, 3096.7856276012517, 3060.7965659316633, 4324.943398671239, 7624.3532778644785, 8552.570701124041, 3832.6558015026358, 3153.0582062575513, 5646.099501366646, 13276.122074577097, 5582.6352136103005, 2166.0767064776555, 4454.425635371692, 1267.8347294402763, 2567.9628129062094, 2841.4647373060716, 2555.7868484867595, 2559.702804053541, 2615.9066264192566, 3454.0070101244864, 2483.3459049894723, 3760.532015909568, 5372.9924158857375, 6434.874658532618, 6475.1623440573885, 3111.269556608694, 5249.889589964156, 3805.9986799188387, 6483.604096979191, 3524.854631464673, 2596.4323077209438, 1928.320769065881, 3888.2159095739603, 18033.18723528011, 2451.571721549558, 4453.079910528993, 5022.339388732607, 3109.4227860675746, 2268.4469130644125, 2445.497151123613, 3683.08122308099, 3281.6649032323644, 3104.1240308229526, 6460.76536639039, 1887.2374587547718, 2912.2266645565724, 5727.829322688068, 3807.438368846949, 3661.451878879706, 2382.8620687868647, 8407.205840987885, 2090.2859983839708, 3077.253704704526, 5008.726785428288, 1499.0663881288908, 3066.734606425461, 5342.365536475756, 3394.398233497225, 2983.0185761941857, 283.1550378543401, 2695.769805387075, 4301.7487955973465, 2880.8390346344113, 5687.7588845587525, 2430.7594026852876, 4041.614672345835, 3616.7742205148643, 3726.453115479002, 2000.5758048552589, 4587.663207914851, 3362.8008311265776, 1666.5338591745058, 2660.631833343237, 2140.2321148630776, 5096.357803926749, 5285.0659204306885, 2357.6836766311308, 6017.780050030185, 3215.3082042881283, 3377.452160688309, 4153.817287956358, 3811.7656345534365, 4672.145059906095, 4357.2879598896, 4506.651606422365, 2790.0937475955275, 2115.228817227222, 3072.839175454844, 8114.28833649313, 2825.456495608763, 3036.9303656126885, 5929.021134961383, 4514.838343882486, 4685.68924298893, 7744.992956829757, 5699.724607813706, 5353.3098299271605, 2109.405616056584, 4039.6681312856795, 2370.9889856645163, 3280.220042374156, 2647.3574138418294, 4323.506180015513, 2456.360483159077, 7202.080574549229, 2734.136619375073, 7489.643724779266, 3389.9974419281616, 2828.0133722315254, 4061.589786744237, 2529.667822057342, 3046.709483417442, 5096.057949924222, 6150.699147612293, 2771.816117936815, 7337.0778745596435, 3919.831279993978, 2996.945576741513, 7089.5998630679, 4747.25793577658, 5160.649549868427, 2113.561043559752, 2109.5567324398226, 5545.16698044071, 5769.91865198394, 3348.735308657892, 3315.735414729145, 1358.7387235579256, 3456.900481652312, 4262.848140385975, 3542.684219422403, 6592.636916719777, 3558.273090284296, 4356.978019160817, 2686.53262356749, 3130.94952360271, 3588.982952545438, 4014.305615747471, 4273.700456117557, 9408.40765933973, 6682.382097057808, 3294.305611586939, 6576.165025289555, 2776.1153291375444, 2397.0787690606717, 2870.3598178541934, 3791.121439579466]
Mean of kurtosis values: 4159.683961614613

HCC1806 SmartSeq experiment: we obtain mean of skewness values = 36.711205637200365 and mean of kurtosis values = 2390.818798228346.

From these graphs, we can deduce that the distributions are highly non normal. Indeed, the high positive kurtosis values indicate a more peaked distribution compared to a normal one. Moreover, the high positive skewness values underline that the distribution is right-skewed.

In general, it is acceptable to deviate from a Gaussian distribution, as not all methods require a normal distribution and this can be addressed during the analysis. Nevertheless, it would be better to reduce skewness since highly skewed data can be challenging to manage.

Data transformation¶

Data transformation could be an option to deal with these problems. A common choice to transform highly skewed data to a distribution closer to a normal one is to apply a log based 2 transformation.

In [56]:
data_log2 = np.log2(data+1)
data_log2
Out[56]:
WASH7P MIR6859-1 WASH9P OR4F29 MTND1P23 MTND2P28 MTCO1P12 MTCO2P12 MTATP8P1 MTATP6P1 ... MT-TH MT-TS2 MT-TL2 MT-ND5 MT-ND6 MT-TE MT-CYB MT-TT MT-TP MAFIP
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam 0.0 0.0 1.000000 0.0 0.0 1.584963 1.584963 0.000000 0.0 4.906891 ... 0.000000 0.000000 0.000000 8.982994 7.209453 2.321928 8.082149 0.000000 2.584963 3.169925
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000 ... 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam 0.0 0.0 0.000000 0.0 0.0 1.000000 1.000000 1.000000 0.0 3.700440 ... 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 6.266787 0.000000 0.000000 0.000000
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0 3.000000 ... 1.000000 0.000000 0.000000 5.491853 3.169925 0.000000 6.066089 0.000000 1.000000 0.000000
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0 6.108524 ... 0.000000 0.000000 0.000000 7.894818 5.000000 2.000000 9.507795 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam 0.0 0.0 0.000000 0.0 0.0 0.000000 1.000000 0.000000 0.0 5.643856 ... 0.000000 0.000000 1.000000 8.417853 5.554589 1.000000 9.157347 0.000000 0.000000 0.000000
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam 0.0 0.0 1.000000 0.0 0.0 1.584963 2.584963 2.584963 0.0 8.535275 ... 0.000000 0.000000 1.584963 10.655531 7.754888 2.807355 11.764042 2.000000 3.000000 2.807355
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam 1.0 0.0 1.000000 0.0 0.0 3.000000 0.000000 0.000000 0.0 5.087463 ... 0.000000 0.000000 0.000000 5.977280 4.392317 0.000000 8.451211 0.000000 1.584963 0.000000
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam 0.0 0.0 2.321928 1.0 0.0 4.906891 2.321928 0.000000 0.0 7.839204 ... 2.000000 0.000000 1.584963 10.918118 9.169925 3.000000 11.093418 1.584963 4.857981 1.000000
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam 1.0 0.0 2.584963 0.0 0.0 2.584963 2.000000 0.000000 0.0 6.169925 ... 2.584963 1.584963 2.000000 10.376125 8.939579 2.321928 10.167418 1.584963 3.584963 2.321928

383 rows × 22905 columns

We visualize violin plots using the same indeces previously randomly selected.

In [57]:
print(ind1)
sns.violinplot(x=data_log2.loc[rows[ind1]])
101
Out[57]:
<Axes: xlabel='output.STAR.2_A3_Norm_S9_Aligned.sortedByCoord.out.bam'>
In [58]:
print(ind2)
sns.violinplot(x=data_log2.loc[rows[ind2]])
48
Out[58]:
<Axes: xlabel='output.STAR.1_E10_Hypo_S220_Aligned.sortedByCoord.out.bam'>
In [59]:
data_log2.T.describe()
Out[59]:
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam ... output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam
count 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 ... 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000 22905.000000
mean 1.892372 0.009677 1.734012 0.409288 1.565756 2.177625 2.542539 2.603964 2.505422 0.628043 ... 1.661318 2.374226 0.512457 1.974534 1.746693 1.626697 2.147861 2.223999 2.371146 2.301653
std 2.744578 0.115966 3.062152 0.933189 2.159384 2.937413 3.167468 3.027512 3.108120 1.184831 ... 2.207995 2.864850 1.881161 2.657156 2.429107 2.242322 2.940746 2.999271 3.099276 2.993962
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 4.169925 0.000000 2.584963 0.000000 3.000000 4.584963 5.321928 5.169925 5.285402 1.000000 ... 3.321928 4.954196 0.000000 4.169925 3.700440 3.321928 4.807355 4.954196 5.285402 5.087463
max 15.512524 3.906891 16.324181 8.179909 13.369461 15.515977 14.850138 15.637446 15.145176 10.738092 ... 14.119671 14.511506 16.322509 14.850431 13.568669 14.235266 14.774272 15.313060 15.497540 16.069932

8 rows × 383 columns

The plot for 50 cells is:

In [60]:
data_small_log2 = data_log2.T.iloc[:, :50]
names_small = names[:50]
plt.figure(figsize=(16,4))

plot=sns.violinplot(data=data_small_log2, palette="Set3", cut=0)
plot.set_xticklabels(names_small, rotation=90, fontsize=6)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()

Let's visualize skewness and kurtosis of the transformed data:

In [61]:
cnames = list(data_log2.T.columns)

colN = np.shape(data_log2.T)[1]
colN
data_log_skew_cells = []
for i in range(colN) :     
    name = data_log2.T[cnames[i]]
    data_log_skew_cells += [skew(name)]
sns.histplot(data_log_skew_cells, bins=100)
plt.xlabel('Skewness of single cells expression profiles - log based 2 df')
plt.show()
In [62]:
print( "Skewness of log base 2 df: ", data_log_skew_cells)
print("Mean skewness:", np.mean(data_log_skew_cells))
Skewness of log base 2 df:  [1.1057395234854044, 15.61679195544179, 1.4670439553197712, 2.675818577529791, 1.204185389342206, 1.0040053557524433, 0.807298608546033, 0.714275166199882, 0.786030463236182, 2.1786644629707634, 3.1037743710662844, 2.3717951551680914, 0.923493751583869, 0.9124854279824427, 0.8549620433809288, 1.5336274062343476, 1.5864532323735512, 1.4396820263053096, 1.3643529331601185, 0.8283157691472006, 0.7585850471537893, 1.346823591574563, 0.9066805806659899, 37.721060851815736, 1.3444946409550385, 0.9501331188032239, 0.8547386066753241, 1.1741243574925364, 1.144944773164568, 1.4860446273370016, 1.4277677684867498, 1.0239458910695918, 0.7979132358687113, 1.7393061803458028, 1.7387396016793497, 3.80251229801015, 1.4080655380951959, 1.0452382744722124, 3.0581623164518823, 1.5362964142750475, 1.4111801073556067, 1.8621435520669851, 1.0455839714775215, 0.9060168412656754, 3.513705417834959, 1.0701002144231377, 68.91277437158584, 1.6248646572728893, 2.281439487283935, 1.048036275816149, 0.8813310635303343, 12.794692170149093, 1.304832175649116, 0.9514583811959377, 2.1788438343743475, 1.741965359578185, 1.6858426876513084, 2.199286012187878, 1.8177910167228806, 1.748642346827045, 14.71277955984679, 1.377640987911948, 1.0502564097404699, 1.6662254938773282, 1.7501823694453975, 1.3359965026274587, 0.942241500832004, 0.7764304838641324, 0.7047173051821963, 2.763775001598615, 2.921790487552734, 1.0382967835341974, 1.4123871797392535, 2.176331050495534, 41.93952365975386, 4.435332036921889, 1.6133161399630747, 1.8521520584202043, 1.6586212482815847, 8.432123213047994, 1.2422153486167304, 2.970133791969762, 1.3547059765155136, 1.8285765248615153, 0.886371325490822, 1.0925921973617385, 1.180015884037801, 151.33406769019496, 3.1091570011468512, 1.7755710608955702, 1.1381694638199766, 32.22013850139017, 0.8035422465709819, 6.984868862901932, 4.928913533358402, 1.2971490261569176, 0.9674021136425309, 0.9052073774913982, 1.2456581519173138, 0.826115978533454, 0.7858582509583311, 0.7017391899857451, 0.8758064955297499, 0.7520449871182536, 0.7801151997369425, 0.9053314100351768, 1.08009974849131, 1.2018221446612645, 1.424296591759856, 0.9228417117754738, 1.0819451586053808, 0.7651914247753657, 0.7347264288819545, 0.8615921672880887, 0.7564018729791854, 0.8239340190062358, 0.7011284779327325, 0.9694712569547393, 22.503534162129593, 0.936884867688853, 0.8738258553059898, 0.9641262872927827, 1.3468417981338792, 0.7765844675843305, 0.610801910301981, 0.7649491166252943, 0.7207559550365991, 0.7063835232435424, 0.6959973332926203, 1.4104561195108318, 0.786928046194934, 0.7655752872927486, 0.8755880726845192, 0.7831625811077118, 0.7829228478525998, 0.7716821251751296, 0.6840072169139697, 0.8184848151061577, 0.878859070590094, 3.9627924332782687, 0.6016558929022284, 1.1556093214167995, 28.260463031636327, 1.2867094456163073, 0.9248725774717333, 0.7922477934999208, 1.0346433216405906, 1.2748466986908435, 1.2292090198224905, 0.7003598611392471, 0.8924297681517445, 0.6719312803023326, 0.7398383291866648, 1.0199092551124689, 0.9945470565293134, 0.8478566434378746, 0.8058845338874457, 0.7815549546606244, 1.6757320454073001, 0.6563185966958098, 0.7932907176995568, 0.7222119922096346, 0.7184283787041242, 0.8355573270472684, 0.8229210630443338, 0.8449396334944688, 0.8053806017207498, 0.9940857312716328, 0.8226981299276362, 1.1896681736429475, 1.0413658368791465, 1.0148234896705304, 0.7823263988015297, 1.0554681243736113, 0.7342675990860774, 0.6931115316773596, 0.6571783507603457, 1.3846268815490965, 0.9482568075599738, 1.183031720838214, 0.9699722615601514, 0.8936365553701933, 2.627526924623496, 0.9148747081506595, 0.8963709355120288, 1.001168390349339, 0.8366906285041781, 0.7739919036106956, 0.7356289399242116, 1.5288881366251685, 0.7264352707907431, 1.188588005962846, 0.943303795258958, 1.1347942044856434, 1.1081506256648526, 0.7820745152080147, 0.7434773587243486, 0.808667661974666, 0.7083084680580254, 0.9154321017080279, 0.6183513448103749, 0.8199175631365759, 0.8370263727754325, 1.2447808345911824, 0.8098399595278659, 1.3708849254801265, 1.6064463690841666, 0.7221909140051747, 0.7546897876732109, 0.7789912404959812, 0.7567175371046488, 0.7232569487479341, 0.71789291359721, 0.9976697384899605, 0.7627632286630976, 1.1511724130937127, 0.9471414105252942, 0.9488823570200783, 1.2066493369134892, 0.6898164736710257, 0.679194095316987, 0.7305475665976033, 0.8309562105757424, 0.955245936363418, 0.6878023004789833, 0.9923894713536716, 0.8669662999210399, 0.9723344050693282, 1.896827846812024, 0.773635853355319, 1.2299547321763948, 0.7053952630219116, 0.7130621928886544, 5.907006058681002, 3.3979654008982996, 0.8597908821688606, 0.9064224345941718, 0.792510490154341, 1.0698707872334352, 1.2172871678741979, 22.341272488120513, 0.9674445705265393, 0.8356704263993622, 0.647868946729868, 0.8722453468005082, 37.41855741390122, 0.651413797855195, 0.7990111493433426, 0.787949762723898, 30.668626051544198, 1.0943786760460303, 0.8014895137673844, 0.9464092736867198, 3.3351571479220374, 0.7085739121391431, 0.6347486486017522, 0.7377284873952035, 0.8335972102916399, 0.727134797022488, 0.7050252934485406, 0.6203981064646984, 0.7428808478442308, 1.295473448106972, 1.334499885047726, 1.5316147849070298, 0.9440742533116802, 0.8232207493450039, 0.7824935560820985, 0.73186015323233, 2.3750895768845885, 0.811390991429295, 0.7399103127715305, 0.6841537285295146, 1.0334276038643886, 1.153924469446368, 0.977969959261133, 0.9043034583213787, 0.7171932283853757, 1.428642034763964, 0.7908961600559683, 0.8167786190219355, 1.0427741010266507, 1.0153354220947812, 0.8437703979852126, 0.8677248316975501, 3.897387651870235, 1.0106764983189107, 0.9608442893631927, 1.300937440018413, 1.2330441192891752, 3.0373230771101696, 0.6875609959318123, 0.7478039343639745, 0.9098418397112991, 0.6726740713048344, 2.704453617346402, 0.7528888883073613, 0.8499822673481012, 0.9272930363907904, 0.7977752187964774, 0.9571172800345298, 0.9009550792839461, 0.9076775167117986, 0.7685028125648107, 0.7255599207723424, 0.7644618889048127, 0.685891751890104, 0.6820508471426424, 0.6784809961264836, 0.7283352926939083, 1.0842459320593594, 1.2912924634404708, 0.8976676094353019, 0.9317799111975271, 0.9300129426423871, 0.7257803358523262, 0.6505847295492089, 0.6752359945057916, 0.7132430177821476, 0.7051704230236815, 0.9642814403189732, 0.7744847506606808, 0.8241155939532887, 0.8053246503449654, 0.9905026499060444, 1.0098798155890325, 0.9405068120172992, 0.9508761684770074, 0.8036721580270824, 1.0265270538433333, 0.9674331828962782, 0.7514923654659174, 0.768416489630283, 1.239500452706095, 1.384122735017081, 1.4120079442655804, 1.1739962390517318, 0.9655366518120655, 0.9345262678620644, 0.680083566182661, 0.7553759618581034, 0.6485331176452879, 0.862996441063931, 0.6995691406207311, 0.6767257601564834, 1.1046512054030682, 1.0087164366220271, 1.0114550497118864, 1.1916094752736364, 1.0470487014413614, 1.1764048076655809, 0.7083220367038057, 0.7301512310411714, 0.6951899254244314, 0.9328140938209916, 0.7453671211784787, 0.7076520057102882, 0.9288325108236056, 0.8417897669360912, 1.1457613831336961, 1.2756000996016925, 1.04763617039199, 1.3301201404004754, 0.6630661183327958, 0.7498980286764945, 0.7084368915358342, 0.9046831158048535, 0.7175918852915999, 0.7429344456048953, 0.8830928052977949, 1.562154547571137, 1.0074738567796786, 1.0671014568585053, 1.0930772063885665, 0.7264184619420785, 3.8595478923583606, 0.9952279539869434, 1.10485198515061, 1.1489063453528123, 0.9534531402883905, 0.9359931959958189, 0.8482511961518842, 0.8585750546288899]
Mean skewness: 2.482476642311377
In [63]:
data_kurt_cells = []
for i in range(colN) :     
    name = data_log2.T[cnames[i]]
    data_kurt_cells += [kurtosis(name)]   
sns.histplot(data_kurt_cells, bins=100)
plt.xlabel('Kurtosis of single cells expression profiles - log based 2 df')
plt.show()
In [64]:
print( "Excess kurtosis of log based 2 distribution: ",  data_kurt_cells)
print("Mean kurtosis:", np.mean(data_kurt_cells))
Excess kurtosis of log based 2 distribution:  [-0.10313617588687274, 321.7228980621053, 0.699056587846735, 8.099655025021805, 0.46953390252257465, -0.2990000642606714, -0.7410539783404424, -0.8274756397974397, -0.7881491075967832, 5.421178737785846, 13.539953146793508, 6.444560654465231, -0.6242714510722296, -0.5407508720475307, -0.5410284782306705, 1.741747726877497, 1.915706932536172, 1.0786174910777584, 0.8999365173423133, -0.6209195978263264, -0.7650753790953289, 1.0625310481811487, -0.41726588866032976, 2001.121073581476, 0.7383702718357847, -0.37532443412558836, -0.6545311622131482, 0.34385686720299047, 0.1886697476446928, 1.4764414513263393, 1.0731646281971843, -0.30557722939786736, -0.7747894863554299, 2.916159144186106, 2.6759563919298897, 19.776659432488245, 1.256419994177966, -0.06792076908983402, 8.456681686016648, 1.6576032498416682, 1.156123278046656, 3.3580267356717908, -0.009889035599438323, -0.5523808121907101, 16.089408662853746, -0.04862013057481418, 5072.39111395831, 1.9866290982021528, 6.006651527805815, 0.03006616359321912, -0.42139896140469846, 201.8947106952415, 0.7550864541320541, -0.41868514764518183, 5.187573532630562, 2.66300956329681, 2.4102464436267477, 5.391907183711728, 3.0576741061829686, 2.783016506474584, 239.93512030770074, 0.9810060692202502, -0.2914922375455089, 2.271116031703275, 2.7553986145691054, 0.810802533077307, -0.38093387442111126, -0.7285934538880712, -0.9752412314289853, 9.964641789611111, 11.606493455263347, 0.021862686528268505, 1.0541516258563366, 5.222301333253647, 1756.9236448070544, 27.46605634736492, 1.8612323031015645, 2.8875128517700537, 2.3186005664705593, 92.85210944037682, 0.49021639600128264, 11.663212737971664, 0.7141451590781047, 3.2419188998913464, -0.5436749952687041, -0.034222056754949826, 0.30014525654406743, 22900.00004366052, 11.952108186824281, 2.7213429389855923, 0.24123058697882405, 1036.1373250487657, -0.7267927750189074, 72.10655772078697, 35.08537439360492, 0.7825385899748056, -0.547742568897398, -0.5694274968119588, 0.11503438446773995, -0.7223838869036556, -0.782132121349552, -0.8159324428904098, -0.4037137031607245, -0.7909170532481005, -0.7474670462440502, -0.5639295661051844, -0.30127871779081383, 0.1771340652984965, 0.6840492025945637, -0.6629599953252008, -0.3210517353494473, -0.7182892083997485, -0.7347618681705677, -0.6410326588690962, -0.855820947613084, -0.771489629510874, -0.7473209800146323, -0.38159511593252393, 739.6468958823361, -0.3113497231519031, -0.7243337016474647, -0.5368496957902935, 0.22197795558373734, -0.7340895413762163, -1.0219978328779065, -0.8738732637615598, -0.8949078579002983, -0.905420036382671, -0.9456433616327495, 0.5594499929366723, -0.7024889895801354, -0.7634128258643185, -0.4104322463156507, -0.7396898316067344, -0.8234579908846116, -0.6685978749829564, -0.853209046664364, -0.715331038096287, -0.6883158882769398, 21.040970536344872, -0.9441327070740804, 0.23581820404444498, 862.9560303885713, 0.5487780426269917, -0.5033598018693146, -0.8449575829116132, -0.39905064710769045, 0.6909298861785262, 0.5375878958942217, -0.931205814382952, -0.6230178667752191, -0.9976502844658341, -0.8999077956821959, -0.35434088395184427, -0.44738221886948004, -0.6116840731347244, -0.5813202376960622, -0.718782388550415, 1.2595529488018586, -0.9158769815149133, -0.8143978031614871, -0.9100045184982766, -0.9351440273445446, -0.7297558162326796, -0.794229983256777, -0.6228503380229222, -0.709875949805753, -0.42332433960107574, -0.6975955758220551, 0.11982950691512118, -0.017459652856274044, -0.06673350158389946, -0.8667007163728799, -0.15132112360857342, -0.7743359621420218, -0.9316270576310393, -0.885311611945252, 1.105526086037747, -0.3422802466308479, 0.16835197698860194, -0.28856897156702654, -0.3889601553214028, 8.387310806558014, -0.5201582309152117, -0.5857197272579562, -0.16302468730124886, -0.5674500912065126, -0.612592176589434, -0.7787527030456558, 0.9953024122791319, -0.8820977131306877, 0.38147886692872346, -0.35743833404651415, 0.09465433708048554, -0.050253710037512445, -0.8451755942064021, -0.808639104728917, -0.6819768513671112, -0.7601061092280585, -0.6206495249753177, -0.9175593725974567, -0.6188645056299809, -0.6758184723411573, 0.18606009094067133, -0.7605585197489644, 0.4857407065639259, 1.1142746463500446, -0.8265286050647944, -0.732840042295066, -0.8062148693117428, -0.8520820352662013, -0.9622683966896797, -0.7659200485307234, -0.13846742042932014, -0.7941869444358503, 0.1470010174637384, -0.513386080300831, -0.37347462667568543, 0.07186501755670571, -0.8725549407974356, -0.8948424207827994, -0.9224026417452804, -0.7213395658043318, -0.540338252478052, -0.8999303806797583, -0.394152817778584, -0.5686612594025964, -0.2627456026941619, 2.3253545451058617, -0.6374659118281509, 0.23465657668124384, -0.8578049232764116, -0.8630494300054758, 44.336024902027425, 14.940175155467912, -0.5667127694630034, -0.28978576866301475, -0.6666520771728086, -0.22521994570655002, 0.1382973586173648, 699.9732706697993, -0.3972841501931321, -0.658036233259867, -0.9836441445313713, -0.640678236023096, 2188.6149320216564, -1.0074123286787162, -0.8454796115124523, -0.7195133865700365, 983.8652125036289, -0.17156529432468748, -0.5944868599465791, -0.5198727136714085, 10.379099763310935, -0.912741258478273, -0.9650393850172625, -0.9209331179593252, -0.7388411692580976, -0.9246396608797824, -0.8604327161400374, -1.0456559445540703, -0.8430046245582656, 0.3559814037495399, 0.6596691077737087, 0.9731997891083495, -0.29906040283010515, -0.5595308442751086, -0.8497142624191487, -0.8688166851118142, 4.44295375304512, -0.8513521625698854, -0.854435915858982, -0.9236712273375449, -0.2769357142827853, 0.01870304675246315, -0.19999634032372748, -0.6458767253983315, -0.8069314006211439, 1.5483204659822878, -0.641785683066169, -0.6345995090696617, 0.05330083733174096, 0.0543830020135907, -0.6434528604956729, -0.5198965477421988, 21.07013061939895, -0.32273446103224535, -0.5564216283351615, 0.5715719399440187, 0.33632963796723514, 9.267595867787088, -0.9011587324342014, -0.9003706910940039, -0.6310042188712961, -0.9208619013962123, 7.459628431670737, -0.6627546750644018, -0.6109922221719404, -0.583522400106772, -0.7670289266322348, -0.3603082442151151, -0.3350399833610429, -0.410601173477827, -0.8153450647490583, -0.8410837592855973, -0.812681877060355, -0.9328820276940091, -1.039482030975073, -0.8927929840156996, -0.8135022999324342, -0.30642314158712836, 0.28850415713596034, -0.6230890529688051, -0.4652109939367479, -0.497122961366677, -0.9244516199851809, -1.0139825248107468, -0.8884331847252054, -0.8865349516187715, -0.9374998044826643, -0.23060462200118437, -0.7843244948390655, -0.690973254867481, -0.6972758440544329, -0.23052798245558348, -0.13114995648226513, -0.5532078740077413, -0.3601498698727932, -0.6515734804131323, 0.06504553412369285, -0.31139815519708014, -0.6292356290920402, -0.745989425820829, 0.4551683474934425, 0.5095305926387228, 0.7133800255466571, 0.09929873715246407, -0.26596204546266167, -0.23986657798665645, -0.8707779095774866, -0.8172340474385011, -0.9266190225331878, -0.7072110196218127, -0.8387146189383641, -0.8386821621426592, 0.04545795850787826, -0.4404062837191307, -0.24759413349008863, 0.43748770399771786, -0.13417654036254234, 0.2120325304908861, -0.9207123063084084, -0.906988355332238, -0.7812962205104244, -0.363237301246484, -0.8082383652261389, -0.7365276039931477, -0.3436011425489669, -0.577054100728843, 0.20232716281881746, 0.4529413640910094, 0.053842718012477864, 0.656067483082936, -0.9307176843339438, -0.8688475184248632, -0.8846004433658998, -0.49994765239269956, -0.9040991920969934, -0.7691041290405685, -0.37427917143847633, 1.1660951929252397, -0.1616462893750863, -0.11295894063272938, 0.24574499831421903, -0.8703293431465919, 14.361884766320088, -0.28731547365797416, 0.04125246646855141, 0.2920720949416209, -0.5265149853062501, -0.5043208109065032, -0.7299227326652322, -0.6611764440126167]
Mean kurtosis: 103.02626328733729

HCC1806 SmartSeq experiment: after applying a log based 2 transformation, we obtain mean skewness = 1.940801299270791 and mean kurtosis = 59.442962646309546.

Taking a log transformation, the dataset is still not perfectly following a normal distribution but the resulting values of skewness and kurtosis are lower than the ones of the original dataset: rescaling it is therefore a good idea by the reasoning above.

In [65]:
data = data_log2

Cell Filtering¶

As our previous analysis shows, we have to filter out cells that show low activity i.e. low gene reads.

In [66]:
row_sum = data.sum(axis=1)
counter = 0
for x in row_sum:
    if x == 0:
        counter += 1

print(counter)
0

There are no cells that have 0 expression of every gene, but still as we have seen before there are cells that have very low gene reads, so they should be removed since they are anomalous.

To understand which cells are anomalous, we decide to make a plot representing the total counts of genes vs the number of expressed ones for each cell.

In [67]:
# Step 1: Calculate total counts of genes for each cell (sum of all elements in each row of the matrix)
total_gene_counts = data.sum(axis=1)

# Step 2: Calculate number of expressed genes for each cell (count of non-zero elements in each row of the matrix)
expressed_genes = (data != 0).sum(axis=1)

# Step 3: Create a scatter plot
plt.scatter(total_gene_counts, expressed_genes)
plt.xlabel('Total Counts of Genes')
plt.ylabel('Number of Expressed Genes')
plt.title('Gene Expression Scatter Plot')
plt.axvline(x=30000, color='salmon', linestyle='--')
plt.axvline(x=63000, color='salmon', linestyle='--')
plt.axhline(y=5000, color='salmon', linestyle='--')
plt.show()

From this plot, we can define as 'outlier' cells the ones with:

  • low gene expression and low total gene counts: their activity is very low;
  • very high gene expression and very high gene counts: their activity is too high, so they behave in an anomalous way.

We defined some bounds simply looking at the plot, in a heuristic way, so it may not be completely accurate.

To remove these cells we have identified, we create a copy of data so that we do not modify the original dataframe while working on it.

In [68]:
data_copy = data.copy()
In [69]:
original_indices = data_copy.index


# Calculate total counts of genes for each cell
total_gene_counts = data_copy.sum(axis=1)

# Calculate number of expressed genes for each cell
expressed_genes = (data_copy != 0).sum(axis=1)

total_counts_range = (30000, 63000)
expressed_genes_range = (5000, 14000)
data_copy['total_counts'] = total_gene_counts
data_copy['expressed_genes'] = expressed_genes 
# Filter the dataset to select only the rows within the specified range
data_filtered = data_copy.loc[
    (data_copy['total_counts'] >= total_counts_range[0]) & (data_copy['total_counts'] <= total_counts_range[1]) &
    (data_copy['expressed_genes'] >= expressed_genes_range[0]) & (data_copy['expressed_genes'] <= expressed_genes_range[1])
].reset_index(drop=True)
original_indices = original_indices[(data_copy['total_counts'] >= total_counts_range[0]) & (data_copy['total_counts'] <= total_counts_range[1]) &
    (data_copy['expressed_genes'] >= expressed_genes_range[0]) & (data_copy['expressed_genes'] <= expressed_genes_range[1])
]

data_filtered = data_filtered.iloc[:, :22905] # remove columns total_counts and expressed_genes
data_filtered.index = original_indices

We visualize again the plot and do some other checks to verify that we correctly removed the cells we defined as 'outliers'.

In [70]:
# Step 1: Calculate total counts of genes for each cell
total_gene_counts = data_filtered.sum(axis=1)

# Step 2: Calculate number of expressed genes for each cell
expressed_genes = (data_filtered != 0).sum(axis=1)

# Step 3: Create a scatter plot
plt.scatter(total_gene_counts, expressed_genes)
plt.xlabel('Total Counts of Genes')
plt.ylabel('Number of Expressed Genes')
plt.title('Gene Expression Scatter Plot')
plt.axvline(x=30000, color='salmon', linestyle='--')
plt.axvline(x=63000, color='salmon', linestyle='--')
plt.axhline(y=5000, color='salmon', linestyle='--')
plt.show()
In [71]:
total_gene_counts_filtered = data_filtered.sum(axis=1)

expressed_genes_filtered = (data_filtered != 0).sum(axis=1)


for x in total_gene_counts_filtered:
    assert x >= 30000
    assert x <= 63000

for x in expressed_genes_filtered:
    assert x >= 5000
# we assert that all the values are in the correct range, so we have removed the outliers
In [72]:
data_filtered.head()
Out[72]:
WASH7P MIR6859-1 WASH9P OR4F29 MTND1P23 MTND2P28 MTCO1P12 MTCO2P12 MTATP8P1 MTATP6P1 ... MT-TH MT-TS2 MT-TL2 MT-ND5 MT-ND6 MT-TE MT-CYB MT-TT MT-TP MAFIP
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam 0.0 0.0 1.000000 0.0 0.0 1.584963 1.584963 0.0 0.0 4.906891 ... 0.0 0.0 0.0 8.982994 7.209453 2.321928 8.082149 0.0 2.584963 3.169925
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam 0.0 0.0 0.000000 0.0 0.0 1.000000 1.000000 1.0 0.0 3.700440 ... 0.0 0.0 0.0 1.000000 0.000000 0.000000 6.266787 0.0 0.000000 0.000000
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam 0.0 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 6.108524 ... 0.0 0.0 0.0 7.894818 5.000000 2.000000 9.507795 0.0 0.000000 0.000000
output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam 0.0 0.0 1.000000 0.0 0.0 1.000000 2.000000 0.0 0.0 7.672425 ... 1.0 0.0 0.0 9.805744 6.832890 2.000000 11.408330 1.0 1.000000 0.000000
output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam 0.0 0.0 3.459432 0.0 0.0 2.000000 3.459432 1.0 0.0 9.434628 ... 1.0 0.0 2.0 10.321928 7.044394 0.000000 13.187197 1.0 1.000000 0.000000

5 rows × 22905 columns

In [73]:
data = data_filtered

HCC1806 experiment: we do the same procedure, using as ranges: total_counts_range = (41000, 80000) and expressed_genes_range = (7100, 13300).

Feature scaling and normalization¶

We need to notice that each individual cell was sequenced independently; this implies the possibility that the data may require normalization across cells. Normalization is the process of transforming a dataset to a common scale. This tranformation does not always lead to have a Gaussian distribution, but this is accettable as explained before.

Let's plot the the gene expression distributions of some selected cells from our dataset.

In [74]:
data_small = data.T.iloc[:, :20]  #just selecting part of the samples so run time not too long
sns.displot(data=data_small,palette="Set3",kind="kde", bw_adjust=2)
Out[74]:
<seaborn.axisgrid.FacetGrid at 0x134e2f190>

We can see that the distribution of each cell shows two peaks: this is expected since they represent genes of low and high abundance.

If we visualize the distribution of a single cell, we can clearly see this behaviour.

In [75]:
data_small_cell = data.loc['output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam'] 
sns.displot(data_small_cell, kind="kde", bw_adjust=2)
plt.show()

In order to compare the plots of our dataset (that we have filtered in the previous steps), we open a filtered-normalized dataset of the same experiment.

In [76]:
norm_df = pd.read_csv("/Users/ela/Documents/AI_LAB/SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt",delimiter="\ ",engine='python',index_col=0)
norm_df = norm_df.T
print("Dataframe dimesions:", np.shape(norm_df))
Dataframe dimesions: (250, 3000)

Since we took a log transformation on our dataset, let's do the same with the normalized one to have the plots on a similar scale.

In [77]:
norm_df = np.log2(norm_df+1)
In [78]:
norm_df_small = norm_df.T.iloc[:, :20]  #just selecting part of the samples so run time not too long
sns.displot(data=norm_df_small,palette="Set3",kind="kde", bw_adjust=2)
plt.show()

Again, let's visualize the plot of the previously chosen single cell for these datasets:

In [79]:
norm_small_cell1 = norm_df.loc['"output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam"'] 
sns.displot(norm_small_cell1, kind="kde", bw_adjust=2)
plt.show()

The plots of the normalized data seem already quite similar to the ones of our dataset; let's try to apply some normalization technique and see how they change. We choose to apply StandardScaler, since it is not so much affected by outliers and standard approach is easily interpretable by a biologist.

Using StandardScaler, we subtract the mean value and divide by variance for every row.

In [80]:
from sklearn.preprocessing import StandardScaler
import pandas as pd


# Initialize the StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the data and transform it
data_standardized = scaler.fit_transform(data.T)

data_standardized = pd.DataFrame(data_standardized.T, columns=data.columns, index=data.index)
In [81]:
data_standardized.head()
Out[81]:
WASH7P MIR6859-1 WASH9P OR4F29 MTND1P23 MTND2P28 MTCO1P12 MTCO2P12 MTATP8P1 MTATP6P1 ... MT-TH MT-TS2 MT-TL2 MT-ND5 MT-ND6 MT-TE MT-CYB MT-TT MT-TP MAFIP
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam -0.689509 -0.689509 -0.325147 -0.689509 -0.689509 -0.112008 -0.112008 -0.689509 -0.689509 1.098378 ... -0.689509 -0.689509 -0.689509 2.583558 1.937346 0.156514 2.255324 -0.689509 0.252354 0.465493
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam -0.566285 -0.566285 -0.566285 -0.566285 -0.566285 -0.239710 -0.239710 -0.239710 -0.566285 0.642186 ... -0.566285 -0.566285 -0.566285 -0.239710 -0.566285 -0.566285 1.480290 -0.566285 -0.566285 -0.566285
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam -0.725110 -0.725110 -0.725110 -0.725110 -0.725110 -0.725110 -0.725110 -0.725110 -0.725110 2.103780 ... -0.725110 -0.725110 -0.725110 2.931022 1.590416 0.201101 3.678000 -0.725110 -0.725110 -0.725110
output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam -0.741357 -0.741357 -0.400914 -0.741357 -0.741357 -0.400914 -0.060471 -0.741357 -0.741357 1.870667 ... -0.400914 -0.741357 -0.741357 2.596941 1.584853 -0.060471 3.142530 -0.400914 -0.400914 -0.741357
output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam -0.802721 -0.802721 0.289478 -0.802721 -0.802721 -0.171288 0.289478 -0.487005 -0.802721 2.175946 ... -0.487005 -0.802721 -0.171288 2.456082 1.421310 -0.802721 3.360694 -0.487005 -0.487005 -0.802721

5 rows × 22905 columns

In [82]:
data_stand_df_small = data_standardized.T.iloc[:, :20]  #just selecting part of the samples so run time not too long
sns.displot(data_stand_df_small,palette="Set3",kind="kde", bw_adjust=2)
plt.show()
In [83]:
data_stand_cell= data_standardized.loc['output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam'] 
sns.displot(data_stand_cell, kind="kde", bw_adjust=2)
plt.show()

Let's compute the values of skeweness and kurtosis of the standardized dataset:

In [84]:
print( "Skeweness: ",  skew(data_standardized))
print("Mean skeweness:", np.mean(skew(data_standardized)))
print()
print( "Kurtosis: ",  kurtosis(data_standardized))
print("Mean kurtosis:", np.mean(kurtosis(data_standardized)))
Skeweness:  [ 3.17808611  3.01766603  0.5686532  ...  0.56069821 -0.05658675
  1.06748017]
Mean skeweness: 1.320638567645328

Kurtosis:  [15.5641668  19.7366884  -0.52045079 ... -0.57930253 -0.75553775
  0.15421078]
Mean kurtosis: 10.02160514345388

The resulting values are quite low: the distribution is still non-normal, but we reduced a lot the skeweness compared to the original dataset's one and also compared to the values we obtained just taking the log transformation. This is a good result, since high skeweness values may lead to problems, as already pointed out.

In conclusion, standardization seems to be a good way to scale our parameters, so we decide to apply it.

In [85]:
data = data_standardized

HCC1806 experiment: the same conclusion applies, since we find mean skeweness of 1.2134374660939942 and mean kurtosis of 10.25664719668806.

In [86]:
data.shape
Out[86]:
(316, 22905)

Feature selection¶

Another important part of our analysis is the selection of genes that are connected to the Hypoxia and Normoxia conditions. We can try to select them using the concepts of entropy and information gain: the most important genes are the ones that give us the highest values of information gain.

Information gain is a measure used to quantify the usefulness of a feature (in this case, a gene) in predicting the target variable ('Hypoxia' or 'Normoxia').

In [87]:
merge = data.merge(data_meta, left_index=True, right_index=True, how="inner")
data_lab = merge.drop(["Cell Line", "Lane", "Pos", "Hours", "PreprocessingTag", "ProcessingComments", "Cell name"], axis=1)
data_lab["Condition"]
Out[87]:
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam    Hypo
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam    Hypo
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam      Norm
output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam      Norm
output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam      Norm
                                                            ... 
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam    Norm
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam    Norm
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam    Hypo
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam    Hypo
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam    Hypo
Name: Condition, Length: 316, dtype: object
In [88]:
data_genes = data.T
n = len(data.index)

We now calculate the information gain for each gene, using the corresponding target variable data_lab["Condition"] (so the label 'Hypoxia' or 'Normoxia'). We can see that approximately the first 3000 genes of this list (3414) are the ones that are the most useful in our prediction. We thus keep those with information gain higher than 0.215 and visualize this in a plot:

In [89]:
from sklearn.feature_selection import mutual_info_classif

# Calculate the information gain for each gene
information_gain = mutual_info_classif(data, data_lab["Condition"])

# Sort the genes based on their information gain (descending order)
sorted_genes = np.argsort(information_gain)[::-1]
sorted_genes = sorted_genes[:3415]
# Print the selected genes
for gene_index in sorted_genes:
    print(f"{data.columns[gene_index]}: Information Gain = {information_gain[gene_index]}")
NDRG1: Information Gain = 0.652850010929818
BNIP3: Information Gain = 0.6293716987199368
HK2: Information Gain = 0.6229342045507877
P4HA1: Information Gain = 0.6182155449226839
GAPDHP1: Information Gain = 0.6146039495703683
BNIP3L: Information Gain = 0.6112144648963216
MT-CYB: Information Gain = 0.6071714290100367
MT-CO3: Information Gain = 0.6069431008363095
FAM162A: Information Gain = 0.5978657302685877
LDHAP4: Information Gain = 0.596249888004365
ENO2: Information Gain = 0.5901785407123785
HILPDA: Information Gain = 0.5893623976490796
ERO1A: Information Gain = 0.5885970357759547
PDK1: Information Gain = 0.5848792556958529
PGK1: Information Gain = 0.5829428094806763
VEGFA: Information Gain = 0.5788163036952783
C4orf3: Information Gain = 0.5759660862000839
LDHA: Information Gain = 0.5705938251485365
KDM3A: Information Gain = 0.567630555850209
DSP: Information Gain = 0.5673717635860669
PFKP: Information Gain = 0.5658814558901728
PFKFB3: Information Gain = 0.5621446524474103
DDIT4: Information Gain = 0.5587321183145155
PFKFB4: Information Gain = 0.5563513951856269
GAPDHP65: Information Gain = 0.5518422869034343
CYP1B1: Information Gain = 0.5478279572893794
GPI: Information Gain = 0.5463414129938879
MTATP6P1: Information Gain = 0.5433365961379271
CYP1B1-AS1: Information Gain = 0.5399181599619258
AK4: Information Gain = 0.5313380424537746
IRF2BP2: Information Gain = 0.5262810927773491
BNIP3P1: Information Gain = 0.5231721303178609
MT-ATP8: Information Gain = 0.5227929190054736
MXI1: Information Gain = 0.521986729924979
MT-ATP6: Information Gain = 0.5159273549148862
TLE1: Information Gain = 0.5121431131814433
FUT11: Information Gain = 0.5079246329268166
RIMKLA: Information Gain = 0.5075676127304747
UBC: Information Gain = 0.5017494409121412
IFITM2: Information Gain = 0.49174473971312427
CIART: Information Gain = 0.4838288131062345
TES: Information Gain = 0.48314077448680215
HK2P1: Information Gain = 0.48164542987256986
HIF1A-AS3: Information Gain = 0.48019147805755935
GBE1: Information Gain = 0.4682404929095607
MYO1B: Information Gain = 0.4671360917422964
GAPDH: Information Gain = 0.4652759935255599
P4HA2: Information Gain = 0.4612358739432907
SLC2A1: Information Gain = 0.45713553373714055
PGK1P1: Information Gain = 0.45594851318528784
ITGA5: Information Gain = 0.455490388642255
NFE2L2: Information Gain = 0.45355037214793414
ALDOA: Information Gain = 0.4534648457013917
RSBN1: Information Gain = 0.4478332979003772
MT-TK: Information Gain = 0.4427776960928569
EIF1: Information Gain = 0.43738209181343946
FDPS: Information Gain = 0.4364298819238388
STC2: Information Gain = 0.4354350129808717
DYNC2I2: Information Gain = 0.4306562276949879
MT-CO2: Information Gain = 0.4297002077703005
PGAM1: Information Gain = 0.4288020568529851
TMEM45A: Information Gain = 0.4280475241919677
ENO1: Information Gain = 0.4256631037747065
ALDOAP2: Information Gain = 0.4239391871018834
PTPRN: Information Gain = 0.4238279266457914
MIR210HG: Information Gain = 0.42375072209394915
RUSC1-AS1: Information Gain = 0.42107955859357005
FOSL2: Information Gain = 0.4209215935506061
C8orf58: Information Gain = 0.4201690137287297
PYCR3: Information Gain = 0.41982872951443606
ELOVL2: Information Gain = 0.4188648524431189
RAP2B: Information Gain = 0.4188411585428118
HLA-B: Information Gain = 0.4188350913738763
BHLHE40: Information Gain = 0.4186440514180141
RIOK3: Information Gain = 0.4181540271030093
BHLHE40-AS1: Information Gain = 0.41804350926163614
KRT80: Information Gain = 0.4165622772857305
SOX4: Information Gain = 0.4156009586455778
P4HA2-AS1: Information Gain = 0.4144323738931335
CYP1A1: Information Gain = 0.4132153269937726
USP3: Information Gain = 0.4121497462788888
SNRNP25: Information Gain = 0.41099315828642125
TNFRSF21: Information Gain = 0.41085701897253313
TANC2: Information Gain = 0.4101959471566188
PSME2: Information Gain = 0.40907793829418426
GAREM1: Information Gain = 0.40857925799250605
IER5L: Information Gain = 0.4069949408536011
AK1: Information Gain = 0.4050086126044392
WDR45B: Information Gain = 0.40402460462585
EGLN3: Information Gain = 0.4031746943669017
PGK1P2: Information Gain = 0.40306239617236583
EGLN1: Information Gain = 0.40203268782978707
GAPDHP72: Information Gain = 0.4008426894940911
PGP: Information Gain = 0.3988082639364958
CEBPG: Information Gain = 0.3980669873046683
SPOCK1: Information Gain = 0.39798202402055716
IFITM3: Information Gain = 0.397474754203881
DAPK3: Information Gain = 0.3973120603185849
GNA13: Information Gain = 0.39673965316054893
HLA-C: Information Gain = 0.39654383507888125
ACTG1: Information Gain = 0.3964875138159565
NAMPT: Information Gain = 0.39614301400784724
DSCAM-AS1: Information Gain = 0.39605419573237866
CLK3: Information Gain = 0.3954574889605338
SLC9A3R1: Information Gain = 0.39517935216581423
PNRC1: Information Gain = 0.39363140575457756
IGFBP3: Information Gain = 0.3931719534188891
SPRY1: Information Gain = 0.3925983191882785
MIR6892: Information Gain = 0.3923074540317111
NEBL: Information Gain = 0.3923034009119848
BBC3: Information Gain = 0.39161078593796916
PGM1: Information Gain = 0.3911060402431925
ADM: Information Gain = 0.39087034106792773
QSOX1: Information Gain = 0.3867698822775105
DARS1: Information Gain = 0.3857501826512435
MKNK2: Information Gain = 0.38513592976897404
SLC27A4: Information Gain = 0.38488776243527867
EML3: Information Gain = 0.3834618921920767
EMP2: Information Gain = 0.38236422581907115
SDF2L1: Information Gain = 0.38158946030759044
ST3GAL1: Information Gain = 0.3807268894099909
TGIF1: Information Gain = 0.37842615624904785
GAPDHP70: Information Gain = 0.37807564586542686
MRPL4: Information Gain = 0.37775753623058184
DAAM1: Information Gain = 0.37772388209605534
LY6E: Information Gain = 0.37718353502666613
IDI1: Information Gain = 0.3764664715381756
TST: Information Gain = 0.37402423295649734
SLC9A3R1-AS1: Information Gain = 0.3733853335277135
IFITM1: Information Gain = 0.373342356138197
HNRNPA2B1: Information Gain = 0.37275788251115194
CCNG2: Information Gain = 0.3726036381922173
TRAPPC4: Information Gain = 0.37222042467406746
VLDLR-AS1: Information Gain = 0.37166728309810226
GAPDHP60: Information Gain = 0.3715842521944981
LSM4: Information Gain = 0.36904358922180713
NCK2: Information Gain = 0.36878149996260157
ARPC1B: Information Gain = 0.36826773787403444
GABARAP: Information Gain = 0.36795610837167536
LDHAP7: Information Gain = 0.36792313803617516
TSC22D2: Information Gain = 0.36789543271156844
PRELID2: Information Gain = 0.36715733118201443
MSANTD3: Information Gain = 0.3671229762959629
RAD9A: Information Gain = 0.36659651527949566
POLR1D: Information Gain = 0.3662021528445567
MIR3615: Information Gain = 0.3661494301488195
CA9: Information Gain = 0.365619373506876
PSME2P2: Information Gain = 0.36559447938884904
MKRN1: Information Gain = 0.36508193097313857
CTPS1: Information Gain = 0.36402593760931246
NTN4: Information Gain = 0.363195548010562
NDUFS8: Information Gain = 0.3628959134016638
LDHAP2: Information Gain = 0.3625219110136326
NDUFB8: Information Gain = 0.362078184156331
ZNF292: Information Gain = 0.36197735510717166
SRM: Information Gain = 0.36187523886220085
BTG1: Information Gain = 0.36170170174064564
OSER1: Information Gain = 0.36161506766626816
ELF3: Information Gain = 0.36096679463213244
CTNNA1: Information Gain = 0.3608324057666934
RNF183: Information Gain = 0.3604693097536713
DHRS3: Information Gain = 0.3603187937028458
MIR7703: Information Gain = 0.36012296500030905
KCMF1: Information Gain = 0.3601103841923736
FTL: Information Gain = 0.3595725131269818
C2orf72: Information Gain = 0.35904665335610564
DDIT3: Information Gain = 0.3586628397863165
STK38L: Information Gain = 0.35813983280888495
SMAD2: Information Gain = 0.35749187421427986
EGILA: Information Gain = 0.35660910740014384
SMAD9: Information Gain = 0.35656762712923107
IL27RA: Information Gain = 0.35654257378000187
FAM110C: Information Gain = 0.35629791054717197
RBPJ: Information Gain = 0.3558217952029479
ESYT2: Information Gain = 0.35545064857981057
TUBD1: Information Gain = 0.35493934989054643
ZNF160: Information Gain = 0.3546031869928732
PKM: Information Gain = 0.35424225765755923
TGFBI: Information Gain = 0.35404245709042814
TMSB10: Information Gain = 0.35389879248481226
MACC1: Information Gain = 0.3529591099199798
PAM: Information Gain = 0.3527214546612061
IGDCC3: Information Gain = 0.35267208585575793
ZYX: Information Gain = 0.3511749011782481
HMOX1: Information Gain = 0.35104895398183356
HELLS: Information Gain = 0.3509937256917899
SFXN2: Information Gain = 0.35056741535459324
FNIP1: Information Gain = 0.3493812174923647
GAPDHP61: Information Gain = 0.348093168337664
TPD52: Information Gain = 0.34801790899781015
CRELD2: Information Gain = 0.3477293376504813
TXNRD1: Information Gain = 0.34757296411238014
RORA: Information Gain = 0.3464872207638918
WASF2: Information Gain = 0.3463019058808874
RAMP1: Information Gain = 0.34619204392200165
RND3: Information Gain = 0.3460187732111957
ZNF395: Information Gain = 0.3460147086576546
FYN: Information Gain = 0.3456459956615583
GAPDHP63: Information Gain = 0.3450614953671438
UHRF1: Information Gain = 0.3450421573420035
TUBG1: Information Gain = 0.3444244037995772
EIF4A2: Information Gain = 0.34418989125064314
KLF3: Information Gain = 0.3439628991066539
RHOD: Information Gain = 0.34392335356920567
DAPP1: Information Gain = 0.3438721363888555
AVL9: Information Gain = 0.3433941577119186
SLC3A2: Information Gain = 0.3432587678042627
TFG: Information Gain = 0.34274340074686105
TCAF2P1: Information Gain = 0.34273958916807645
RCAN3: Information Gain = 0.34270173540916193
PPP1CA: Information Gain = 0.3426022649570626
MIR5047: Information Gain = 0.34205982873118934
LRR1: Information Gain = 0.3416935755454038
YEATS2: Information Gain = 0.34168645726383096
MYL12A: Information Gain = 0.34157029777984205
BEST1: Information Gain = 0.3415170720983203
CLDND1: Information Gain = 0.34142110741240894
NUPR1: Information Gain = 0.3412992639358141
ARFGEF3: Information Gain = 0.3411195798043589
FTH1: Information Gain = 0.3407931761362468
HMBS: Information Gain = 0.340591026039869
DUSP10: Information Gain = 0.3393194040131773
ALOX5AP: Information Gain = 0.3390106069880774
VLDLR: Information Gain = 0.3389803961087652
SINHCAF: Information Gain = 0.3382079122985737
RPL17P50: Information Gain = 0.3381513455473184
RNF19B: Information Gain = 0.33780075663637854
ZFAS1: Information Gain = 0.3374762237584428
FASN: Information Gain = 0.3369478110769608
PGM2L1: Information Gain = 0.33654859510549184
RRAGD: Information Gain = 0.3365130732238193
MYRIP: Information Gain = 0.33627803887644836
GGCT: Information Gain = 0.3360943731512249
KLF3-AS1: Information Gain = 0.3359169414860945
DCXR: Information Gain = 0.3350225570572125
TLE1P1: Information Gain = 0.3347659397844036
CDC42EP1: Information Gain = 0.3347134397284013
RPL34: Information Gain = 0.33450125917677065
PCAT6: Information Gain = 0.33443298651000264
EBP: Information Gain = 0.3341287613252497
DUSP4: Information Gain = 0.3338982090514582
CHD2: Information Gain = 0.33343118788088444
ANGPTL4: Information Gain = 0.33175154774551063
RUNX1: Information Gain = 0.331369692996182
INSIG2: Information Gain = 0.33104854544093665
PHLDA3: Information Gain = 0.33104317400127603
GAPDHP40: Information Gain = 0.33101985240964926
RANBP1: Information Gain = 0.33090936165847284
POLR2L: Information Gain = 0.3308256427945433
RNASE4: Information Gain = 0.33079346977729474
DNPH1: Information Gain = 0.33067321033645514
HPDL: Information Gain = 0.330410292103855
POP5: Information Gain = 0.3296129274701629
ATP5F1D: Information Gain = 0.3291292645142707
THAP8: Information Gain = 0.3284083698130156
WEE1: Information Gain = 0.3283100110995478
CCNI: Information Gain = 0.3282750713628402
SLC29A1: Information Gain = 0.32796683907426116
TRIB3: Information Gain = 0.32683887210551266
KLF7: Information Gain = 0.3268368450397412
FOXO3: Information Gain = 0.326681880222085
PSME2P1: Information Gain = 0.3262130833146015
GNAS-AS1: Information Gain = 0.32564097711994533
FAM220A: Information Gain = 0.3254310982002362
ZNF12: Information Gain = 0.3253328113026219
NUDT5: Information Gain = 0.32505185344772247
MFSD3: Information Gain = 0.3249535962345529
ANG: Information Gain = 0.3248630408018913
DOK7: Information Gain = 0.3245204183557826
PRMT6: Information Gain = 0.32450453828930503
FBXL6: Information Gain = 0.3238838064446756
ELOVL6: Information Gain = 0.3237582352083239
VDAC1: Information Gain = 0.3236953619518801
STRA6: Information Gain = 0.3234145245361739
ASNSP1: Information Gain = 0.32335962986940836
HNRNPAB: Information Gain = 0.32332925710335103
CAPN2: Information Gain = 0.3224888605805294
SLITRK6: Information Gain = 0.32244457539653526
GRB10: Information Gain = 0.32242624636886474
FEN1: Information Gain = 0.32136379373198953
FBXO42: Information Gain = 0.32079991093889904
SLC25A36: Information Gain = 0.320694633789792
CDC42EP3: Information Gain = 0.32067954277740784
GET1: Information Gain = 0.32013105592496616
PCBP1-AS1: Information Gain = 0.32010926048400723
FOXO1: Information Gain = 0.3199641336740555
HEY1: Information Gain = 0.31994812160074027
FAM13A: Information Gain = 0.31990516138402025
BCL10: Information Gain = 0.3198850248481111
FBXO16: Information Gain = 0.3197602617558919
PDZK1: Information Gain = 0.31970430083513146
PTGER4: Information Gain = 0.31960954182225954
TFRC: Information Gain = 0.3195625112278613
KDM5B: Information Gain = 0.3188760416491607
GINS2: Information Gain = 0.3183553236150536
VPS37D: Information Gain = 0.31799650745238695
ADCY9: Information Gain = 0.31767419372839734
LRATD2: Information Gain = 0.3171161426448659
NDUFC2: Information Gain = 0.3167477729169792
NECAB1: Information Gain = 0.3161118054466716
TKFC: Information Gain = 0.3158320063818574
TRIM16: Information Gain = 0.3157956379198734
CDC45: Information Gain = 0.31530942073784374
LINC02649: Information Gain = 0.3152550893733177
TMEM265: Information Gain = 0.31503321958122377
EDN2: Information Gain = 0.31454691204912977
DENND11: Information Gain = 0.3143727433403063
SRF: Information Gain = 0.31422955560186216
GPS1: Information Gain = 0.3141412830672585
FAM13A-AS1: Information Gain = 0.31299800655638
PDLIM5: Information Gain = 0.3129442000325542
KLHL2P1: Information Gain = 0.31264732761354486
ATP5MC1: Information Gain = 0.3125339901952562
ZBTB21: Information Gain = 0.312330833013021
CFD: Information Gain = 0.3120811544099096
EMX1: Information Gain = 0.31189765444211415
PLBD1: Information Gain = 0.3118961097736972
PTPRH: Information Gain = 0.31171136273245437
ATP5F1E: Information Gain = 0.3114680363353515
APEH: Information Gain = 0.3111802559311103
TCAF2: Information Gain = 0.3110020449343913
MAP1B: Information Gain = 0.31090027335479387
TMEM64: Information Gain = 0.3106699917456266
NECTIN2: Information Gain = 0.3106558505166437
NDUFS6: Information Gain = 0.3105691445312495
TMEM123: Information Gain = 0.3105040936939487
CERS4: Information Gain = 0.3099918617220536
LDHAP3: Information Gain = 0.3098170964458886
CD55: Information Gain = 0.30978055710542063
EIF4EBP1: Information Gain = 0.3097658298422141
PAGR1: Information Gain = 0.3097245388671901
ADAMTS19-AS1: Information Gain = 0.30965733820028163
SEC31A: Information Gain = 0.3095088742066143
FADS1: Information Gain = 0.3094381943409463
GPNMB: Information Gain = 0.30917942831103984
MSANTD3-TMEFF1: Information Gain = 0.30910206582142385
CHMP4C: Information Gain = 0.30900883685448766
TMEM65: Information Gain = 0.3084778644514743
IMMP2L: Information Gain = 0.30846963605807076
RLF: Information Gain = 0.3082882382988481
GAD1: Information Gain = 0.30817765657000007
SDAD1P1: Information Gain = 0.3080022192929237
ANKRD12: Information Gain = 0.307882980052651
SNX27: Information Gain = 0.3075631715286389
RPL21: Information Gain = 0.30736080975371194
ASF1B: Information Gain = 0.30702111713878777
C1QBP: Information Gain = 0.3068227461494053
DHCR7: Information Gain = 0.30676716592698194
FADS2: Information Gain = 0.30660423965720307
ACLY: Information Gain = 0.306483535366455
CENATAC-DT: Information Gain = 0.3064441246016518
FTH1P16: Information Gain = 0.30600373891133614
H2AX: Information Gain = 0.3057210705939095
VEGFC: Information Gain = 0.3053664819969162
LOXL2: Information Gain = 0.304988153436617
MYO1E: Information Gain = 0.30366960851032876
CCDC28B: Information Gain = 0.3036646783596435
TUFT1: Information Gain = 0.30349431485227374
GAPDHP21: Information Gain = 0.30334089470016967
MOV10: Information Gain = 0.3031108913020306
BCL2: Information Gain = 0.30305833863962883
FLRT3: Information Gain = 0.302677769705896
CBLB: Information Gain = 0.3025725216820696
TRABD2A: Information Gain = 0.30223625603901283
MYO10: Information Gain = 0.302054278174376
MPV17L2: Information Gain = 0.3019493279604124
NDUFB1: Information Gain = 0.30187377585919495
WSB1: Information Gain = 0.3017518948438802
TEDC2: Information Gain = 0.3014144663041751
SDR16C5: Information Gain = 0.3013740670883802
OLFM1: Information Gain = 0.3012045072677103
KLF6: Information Gain = 0.3011681139644382
KPNA2: Information Gain = 0.30109539921754647
CEACAM5: Information Gain = 0.30089604482391996
PHTF1: Information Gain = 0.30080665810492313
ZNF84: Information Gain = 0.3007187828174662
SYT12: Information Gain = 0.3005385073544824
DHRS11: Information Gain = 0.30049312546269746
FDFT1: Information Gain = 0.29992512536029836
MYCBP: Information Gain = 0.2998768601470061
AZIN1: Information Gain = 0.2996308056460979
MYH9: Information Gain = 0.29931831760691674
ACOT7: Information Gain = 0.2992619104811354
DBI: Information Gain = 0.2986955430728764
TTC9: Information Gain = 0.2986518443237698
PPP1R10: Information Gain = 0.2983614160608137
MMP16: Information Gain = 0.29813438959166483
SLC25A10: Information Gain = 0.29807238362168365
SH3GL3: Information Gain = 0.29801330735278353
PSAP: Information Gain = 0.297848973075175
DMRTA1: Information Gain = 0.2976892950213874
ATXN1-AS1: Information Gain = 0.29754998990579473
UNC5B-AS1: Information Gain = 0.29747883171939304
LIMCH1: Information Gain = 0.2973255628283493
FANCG: Information Gain = 0.2972506588534811
AGPS: Information Gain = 0.2966435490167285
BCAS1: Information Gain = 0.29613102875794506
DGKD: Information Gain = 0.29604047815969725
ARL8A: Information Gain = 0.29603795929366883
KCNK5: Information Gain = 0.2959695124190611
PCAT1: Information Gain = 0.2954666553942149
MEIKIN: Information Gain = 0.2953160901274361
TPT1-AS1: Information Gain = 0.2952186439593394
CDK2AP1: Information Gain = 0.29472043161181305
ATXN1: Information Gain = 0.2946223913205934
GPR179: Information Gain = 0.2946153106824134
IFFO2: Information Gain = 0.2944336505932923
KLF11: Information Gain = 0.29420594988367155
ACAT2: Information Gain = 0.2940526832618515
PCP4L1: Information Gain = 0.2939103511675727
GPR146: Information Gain = 0.2938787975029069
MB: Information Gain = 0.2934957539438581
BEND5: Information Gain = 0.29334796276005104
BCL2L12: Information Gain = 0.29302253622902086
COPS9: Information Gain = 0.29258449329476344
DOLK: Information Gain = 0.2924016234559621
PCBP1: Information Gain = 0.29227980584110536
ELOVL5: Information Gain = 0.2922482511323048
SHISA5: Information Gain = 0.2918299340926278
PLOD2: Information Gain = 0.29174907058699673
CSNK1A1: Information Gain = 0.29172910169603483
RNF149: Information Gain = 0.2914991774376765
ATAD3A: Information Gain = 0.2910772157484438
ATF4: Information Gain = 0.2908437416511007
RPL31: Information Gain = 0.2907535224799449
PALLD: Information Gain = 0.2906072798392487
PLOD1: Information Gain = 0.2904364027222146
C1orf116: Information Gain = 0.2902859123379993
ADGRF4: Information Gain = 0.29020803697542075
HLA-W: Information Gain = 0.2901208249649754
GYS1: Information Gain = 0.2900655751474077
TMOD3: Information Gain = 0.2900337462110738
KCNG1: Information Gain = 0.29001031531549515
TPX2: Information Gain = 0.2900045871740078
PTEN: Information Gain = 0.2897363317419843
TAF9B: Information Gain = 0.28963165574572614
BOD1: Information Gain = 0.28943040336402515
EDA2R: Information Gain = 0.28873017518360733
CHRNA5: Information Gain = 0.2885129592685143
HSD17B10: Information Gain = 0.28841393482416455
MALL: Information Gain = 0.28830520230122025
HAUS8: Information Gain = 0.2879304072014175
GADD45A: Information Gain = 0.28788798520622416
B4GAT1: Information Gain = 0.28786909576387854
ARF6: Information Gain = 0.28785399181091953
ZFAND1: Information Gain = 0.28775967143399717
RAB6A: Information Gain = 0.2874887082939297
USP3-AS1: Information Gain = 0.2872057503382648
ELL2: Information Gain = 0.28713750463463317
RET: Information Gain = 0.286085243371917
ATF2: Information Gain = 0.28584917628151674
WDR45BP1: Information Gain = 0.28578903381019516
SIKE1: Information Gain = 0.28575175737510894
KRTAP5-2: Information Gain = 0.2855551688135123
PLIN5: Information Gain = 0.28545308035331174
GAS5: Information Gain = 0.2853596268712164
LRIG3: Information Gain = 0.28530859631222105
NRP1: Information Gain = 0.28529596134736823
GFRA1: Information Gain = 0.2850629302268133
CHAC2: Information Gain = 0.28486026578520596
ATXN3: Information Gain = 0.28455237222775964
TMEM104: Information Gain = 0.2845144601336218
ANKZF1: Information Gain = 0.28439414368713023
ULBP1: Information Gain = 0.2842713581229337
MICB: Information Gain = 0.2840357676990397
IFI35: Information Gain = 0.28382159796524187
HLA-E: Information Gain = 0.28370758459277146
PIK3R3: Information Gain = 0.2836388308339399
NFIL3: Information Gain = 0.283594120218551
PHF19: Information Gain = 0.2834300215909078
CLVS1: Information Gain = 0.28338661697840806
ATP1B1: Information Gain = 0.2831633454056546
CDC25A: Information Gain = 0.2830094930335638
IDI2-AS1: Information Gain = 0.28291140225447675
NDUFC2-KCTD14: Information Gain = 0.2828311697468151
KLHL24: Information Gain = 0.28233429824537315
FBXO32: Information Gain = 0.28231561566636043
TMEM229B: Information Gain = 0.282254847485333
TSPAN4: Information Gain = 0.28217420948011696
FCGRT: Information Gain = 0.28169831411937607
RAP1GAP: Information Gain = 0.2816418145251669
FAM167A: Information Gain = 0.2814093450832613
ENDOG: Information Gain = 0.28133759976424866
TMEM59: Information Gain = 0.2813076689087106
MVK: Information Gain = 0.2812700775642154
GAPDHP71: Information Gain = 0.2810268560807929
POLR3K: Information Gain = 0.2808673136747568
S100A13: Information Gain = 0.2808620932887578
FBXO38: Information Gain = 0.28079868060468494
LDLRAD1: Information Gain = 0.28026312962833044
MT-CO1: Information Gain = 0.2801851646398599
LAMC2: Information Gain = 0.28007836777009776
PPFIA4: Information Gain = 0.279971944903495
ANXA1: Information Gain = 0.27994197387265585
GDF15: Information Gain = 0.27993567211998105
IL3RA: Information Gain = 0.2797414069482893
GPAT3: Information Gain = 0.2797391504208724
SPC24: Information Gain = 0.2796221490503945
UBE2QL1: Information Gain = 0.27960256320645716
MIR6728: Information Gain = 0.27951575119843386
MALAT1: Information Gain = 0.27927849493832535
PLAAT2: Information Gain = 0.27922824268297086
ACTG1P10: Information Gain = 0.2790785764110386
MYL12-AS1: Information Gain = 0.27877409878308135
GOLM1: Information Gain = 0.2783071510387851
MIR1199: Information Gain = 0.2782619457947797
EIF4B: Information Gain = 0.2782392269290086
CYB561A3: Information Gain = 0.2778732456279105
PPM1K-DT: Information Gain = 0.27781180492837865
MRPL28: Information Gain = 0.277723926449845
CDCA7: Information Gain = 0.2776667445604355
CCDC74A: Information Gain = 0.2775895512791404
SLC25A39: Information Gain = 0.27758744922216283
C4orf47: Information Gain = 0.2774811660883547
ABHD15: Information Gain = 0.27746572127972846
ADM2: Information Gain = 0.277462548589982
PYGL: Information Gain = 0.2773830632561749
FRY: Information Gain = 0.27712361152247333
FUOM: Information Gain = 0.27697501464226093
FTLP3: Information Gain = 0.2769067871979518
GPER1: Information Gain = 0.2767797633696436
ZNF689: Information Gain = 0.27673180646210715
GALNT18: Information Gain = 0.2765688189486142
RPS27: Information Gain = 0.2765550083052586
MIR181A1HG: Information Gain = 0.27653945540651814
POLA2: Information Gain = 0.27645649318136
SCEL: Information Gain = 0.27644508610738927
FAM47E-STBD1: Information Gain = 0.2761198426022975
INSYN1-AS1: Information Gain = 0.2760825503175002
SAT1: Information Gain = 0.27601807909488985
FOXP1: Information Gain = 0.2759846960610546
SLC25A35: Information Gain = 0.2759237799206835
HLA-T: Information Gain = 0.2758338164292644
C6orf141: Information Gain = 0.27571457581725656
SERGEF: Information Gain = 0.27510515710005135
TRIM29: Information Gain = 0.27497537034386466
HAUS1: Information Gain = 0.2748742430144069
SPRR1A: Information Gain = 0.2747193031457549
APOBEC3A: Information Gain = 0.27465527841288706
SNTB1: Information Gain = 0.27435341508762967
RNF19A: Information Gain = 0.2742626712785603
YEATS2-AS1: Information Gain = 0.27424764417351577
ATIC: Information Gain = 0.27407261350536216
TMEM54: Information Gain = 0.2739833270241119
CENPM: Information Gain = 0.2737275064007094
P3R3URF-PIK3R3: Information Gain = 0.2736094211579705
GPR155: Information Gain = 0.2733991791469623
RYR2: Information Gain = 0.27333990766439253
SERINC3: Information Gain = 0.27333439570270923
CD9: Information Gain = 0.2733339438288158
CCN4: Information Gain = 0.2732647792037217
MAOB: Information Gain = 0.2730994873583439
RPL7: Information Gain = 0.27309181216141876
TNFRSF19: Information Gain = 0.2730729341908622
LDHAP5: Information Gain = 0.27281341231407175
LRP4: Information Gain = 0.27276394495011314
LPP: Information Gain = 0.2726877939672576
LNPK: Information Gain = 0.2725022984947152
NDUFA4L2: Information Gain = 0.2724591727373227
CAST: Information Gain = 0.2722637510089314
CISD3: Information Gain = 0.27222659330117605
CCSAP: Information Gain = 0.27207630879822164
NAPRT: Information Gain = 0.27192074583119896
METTL7A: Information Gain = 0.27186482517859534
CPEB2: Information Gain = 0.27149418474970255
WDR4: Information Gain = 0.27130327620672334
FTH1P20: Information Gain = 0.27122268636989655
TBC1D8B: Information Gain = 0.2710178806369812
SCARB1: Information Gain = 0.2710060283791378
FAM210A: Information Gain = 0.27100056494115865
PLD1: Information Gain = 0.27098811331982486
CDK5R2: Information Gain = 0.2709189323798129
MTHFD1: Information Gain = 0.2708245218703944
XPOT: Information Gain = 0.27082378271435736
PPP1R3C: Information Gain = 0.27070757068813345
MCM3: Information Gain = 0.2706744684031319
RPL23AP7: Information Gain = 0.2706666920899228
PPP1R14C: Information Gain = 0.270628793362899
TPD52L1: Information Gain = 0.2706237500847388
UNC5B: Information Gain = 0.2705994769100353
FUT3: Information Gain = 0.27058340696118544
JPH2: Information Gain = 0.2705722712696348
SAMD4A: Information Gain = 0.2704497995131947
IGFLR1: Information Gain = 0.27028743243626696
MUC16: Information Gain = 0.27019023099775286
HLA-L: Information Gain = 0.27011851151874344
MRNIP: Information Gain = 0.2699678326861574
ZNF365: Information Gain = 0.26989961038404453
RCN1P2: Information Gain = 0.26986592799502596
RAPGEFL1: Information Gain = 0.26970465917339914
ADAT1: Information Gain = 0.269668358453655
HINT3: Information Gain = 0.26962279661774535
SLC7A11: Information Gain = 0.26954837406896903
RIBC2: Information Gain = 0.2695041333367101
SAMHD1: Information Gain = 0.2694686015826606
GAL: Information Gain = 0.26889526106001127
CXADR: Information Gain = 0.26883249667961273
HSD17B1-AS1: Information Gain = 0.26865426447537266
SMAP1: Information Gain = 0.2685879042865864
ELOVL2-AS1: Information Gain = 0.2685606160497007
LOX: Information Gain = 0.2685120639738079
SHMT1: Information Gain = 0.26847669675550456
KRT83: Information Gain = 0.2684719289962618
NUP62CL: Information Gain = 0.2683971804611138
SPATS2L: Information Gain = 0.2683304311391774
RECQL4: Information Gain = 0.2682228336783341
TKT: Information Gain = 0.2682165828581684
PWWP3B: Information Gain = 0.2680711004412595
INSYN1: Information Gain = 0.26801295082488785
A4GALT: Information Gain = 0.2679370460447652
STING1: Information Gain = 0.26791495656248876
KRTAP5-AS1: Information Gain = 0.2679015189557685
SRPX: Information Gain = 0.26788444019989144
TBC1D3L: Information Gain = 0.26785704007027844
AGMAT: Information Gain = 0.2674493404704301
FRK: Information Gain = 0.26718387439791424
LATS1: Information Gain = 0.2671765825654524
KRT224P: Information Gain = 0.2670133773593706
GRM4: Information Gain = 0.2669191879132642
HOXA10: Information Gain = 0.26677228575890566
PDGFB: Information Gain = 0.2665151606225167
EIF2B3: Information Gain = 0.266485374868352
PACSIN2: Information Gain = 0.26638352435583723
PPM1J: Information Gain = 0.26636598066985906
ST8SIA6-AS1: Information Gain = 0.2661689850786495
RNPEP: Information Gain = 0.266132046843891
CBX5: Information Gain = 0.26611636925031856
PNMA2: Information Gain = 0.26587747696327435
ANXA2R: Information Gain = 0.26586361569191364
PAK6: Information Gain = 0.2657868438263389
GAPDHP73: Information Gain = 0.2657797301792528
EGFR: Information Gain = 0.26571650912922107
FAM111B: Information Gain = 0.26563558960574274
CDKN2AIPNL: Information Gain = 0.26559940324945663
SOGA3: Information Gain = 0.2655402568457026
MCM10: Information Gain = 0.26547568922346954
CD109: Information Gain = 0.2654202016700278
CDC20: Information Gain = 0.26516439256061664
AHR: Information Gain = 0.26511512636575296
HOXA13: Information Gain = 0.26473274528467106
KMT5B: Information Gain = 0.26458030693325285
GAPDHP64: Information Gain = 0.26456750040151666
C15orf65: Information Gain = 0.2645401070197868
FAM214B: Information Gain = 0.2643437475577086
SLC25A15: Information Gain = 0.26428543137815175
S100P: Information Gain = 0.2642798546794567
GAPDHP69: Information Gain = 0.26399190888642043
RIPPLY3: Information Gain = 0.2639675797430987
RAB3IL1: Information Gain = 0.2639308193270762
ALDOAP1: Information Gain = 0.263837402464538
MCRIP2P1: Information Gain = 0.26374732314647864
SLC26A5: Information Gain = 0.2636596468327064
SQSTM1: Information Gain = 0.2634830552261458
TCP11L2: Information Gain = 0.2634748490729
NDUFB10: Information Gain = 0.26347133331276207
POMGNT1: Information Gain = 0.26342751221199956
WDR76: Information Gain = 0.2633516269245191
CHTF8: Information Gain = 0.26334661348193045
OTULINL: Information Gain = 0.2630369569385196
LRATD1: Information Gain = 0.26301356216611516
WDR61: Information Gain = 0.2629718771551306
TTC36: Information Gain = 0.26294214201273536
DPF1: Information Gain = 0.2629234491848389
CFDP1: Information Gain = 0.26291322873594924
ETNK2: Information Gain = 0.26283961517666143
MIR7844: Information Gain = 0.26283889242301717
PARP1: Information Gain = 0.26276813247299
ADGRF1: Information Gain = 0.2626112477871494
IRF6: Information Gain = 0.2625689304528829
LINC00623: Information Gain = 0.2625334334803884
MTCO3P12: Information Gain = 0.2624894537535942
GAPDHP35: Information Gain = 0.2624746781422316
MFSD13A: Information Gain = 0.26246431321333885
ARMC6: Information Gain = 0.26245625092006586
GET1-SH3BGR: Information Gain = 0.2623876112128516
CD320: Information Gain = 0.26231140958174004
MTHFD2: Information Gain = 0.2622930019403662
VAPA: Information Gain = 0.26224578497575846
MIF: Information Gain = 0.2622259267751499
ZNF367: Information Gain = 0.2622067136252557
ZNF148: Information Gain = 0.26217417169400625
SEMA4B: Information Gain = 0.26198125128610794
NECTIN3-AS1: Information Gain = 0.2619539111813203
PCCA-DT: Information Gain = 0.2619133352160974
KCND3: Information Gain = 0.26172317985122007
CAVIN1: Information Gain = 0.261715295201298
ATP5F1A: Information Gain = 0.26169628794437116
PCLAF: Information Gain = 0.26161069246146873
DAPK2: Information Gain = 0.26153577440333664
SLC1A1: Information Gain = 0.26151576022504397
DCAF10: Information Gain = 0.261363261227344
E2F2: Information Gain = 0.26135226966343406
GAS5-AS1: Information Gain = 0.2612816665776385
PPP1R14B-AS1: Information Gain = 0.26117530733960614
XPOTP1: Information Gain = 0.2611103539615405
H3C4: Information Gain = 0.26106165167219775
MRPL38: Information Gain = 0.2610600135754435
GOLGA6L10: Information Gain = 0.26092294354913226
NRGN: Information Gain = 0.2608882588856587
DTL: Information Gain = 0.26086206442982296
HSD17B1: Information Gain = 0.2607906410567955
RGCC: Information Gain = 0.260776672183884
AIFM1: Information Gain = 0.2607430726686919
SNHG22: Information Gain = 0.2605938256697029
MRPL41: Information Gain = 0.26058771359074084
NT5DC2: Information Gain = 0.26055371616864154
CYP4F22: Information Gain = 0.26048301521679074
BEST4: Information Gain = 0.2604471166715654
NKAIN1: Information Gain = 0.2603064311287244
POLD1: Information Gain = 0.2602431687305602
TUBA3E: Information Gain = 0.26016613557879986
KLF13: Information Gain = 0.2601652403395218
LINC01214: Information Gain = 0.2601495462207397
GIHCG: Information Gain = 0.260081036307515
STXBP5-AS1: Information Gain = 0.26006180211843066
CDKN3: Information Gain = 0.2599372481496778
TARS1: Information Gain = 0.2598808850300307
APOL4: Information Gain = 0.2598368372636737
H4C5: Information Gain = 0.25968214666720435
ZNF337: Information Gain = 0.2596069261190588
DHCR24: Information Gain = 0.25954409417646174
PPP2R5B: Information Gain = 0.25953154521978505
PARK7: Information Gain = 0.2594820394582502
CLPSL2: Information Gain = 0.25944520313054675
RTN4RL1: Information Gain = 0.25940848301644737
RNF144A: Information Gain = 0.2593789825581869
FAM86C1P: Information Gain = 0.2593615875143245
AKR1C1: Information Gain = 0.25934155932255964
H2AC7: Information Gain = 0.25931391711131346
EDN1: Information Gain = 0.25922439577914
CBX4: Information Gain = 0.2592018393409252
MIF-AS1: Information Gain = 0.2591724213226312
MAP4K2: Information Gain = 0.2589883354112861
COA8: Information Gain = 0.2589623261795291
IFI30: Information Gain = 0.2589016054554596
BRCA1: Information Gain = 0.2588803441986389
GON7: Information Gain = 0.25876867986184027
RBBP7: Information Gain = 0.25872781739517414
SORL1: Information Gain = 0.25869609049530995
BSCL2: Information Gain = 0.2585453814872547
KRT4: Information Gain = 0.2585029032664219
FGF2: Information Gain = 0.25846692118812165
CDK5: Information Gain = 0.25842636078140746
DMC1: Information Gain = 0.25838924365510496
TUBA4A: Information Gain = 0.2583630279624467
FKBP5: Information Gain = 0.2583341647908879
CCDC107: Information Gain = 0.2582591217234478
H2AC9P: Information Gain = 0.25824760974434735
TMEM74B: Information Gain = 0.2581296005405471
NPC1L1: Information Gain = 0.2580660338711631
NDUFA4: Information Gain = 0.2579507081468715
DRAXIN: Information Gain = 0.25792846995014385
TMEM19: Information Gain = 0.2578648538242809
BMF: Information Gain = 0.2578379382620135
PLEKHG1: Information Gain = 0.2578312249077477
RNF180: Information Gain = 0.2577980879906805
HYMAI: Information Gain = 0.2575080693545433
IFI44: Information Gain = 0.2574102060566896
ARID5A: Information Gain = 0.2573344610623354
PLK1: Information Gain = 0.25732496481268874
CEACAM6: Information Gain = 0.25732000000867594
DNASE1L2: Information Gain = 0.25730241387634134
EEF1A1: Information Gain = 0.25726466271272974
TPSP2: Information Gain = 0.25715090030266197
STBD1: Information Gain = 0.25711061264468515
ZNF528-AS1: Information Gain = 0.25707207791649944
CYRIA: Information Gain = 0.25689805828412915
ENO1P1: Information Gain = 0.2568551819194749
ITGB3BP: Information Gain = 0.25682054236428886
HDHD5-AS1: Information Gain = 0.25672813892366886
TNFRSF18: Information Gain = 0.2566193394329608
SPATA18: Information Gain = 0.25653833152148975
TLCD1: Information Gain = 0.2564550249350461
SNTA1: Information Gain = 0.25642733569281684
MED15: Information Gain = 0.25636897402013537
ZNF682: Information Gain = 0.25606513414719245
AZIN2: Information Gain = 0.2560584761868394
HEATR6: Information Gain = 0.256033918539905
ENOX1: Information Gain = 0.25595366865609215
RNU1-82P: Information Gain = 0.255897863786942
ADRA2A: Information Gain = 0.25585309228671704
CCDC33: Information Gain = 0.25571639445211614
AMPD3: Information Gain = 0.25566919660306775
TNFRSF6B: Information Gain = 0.25559930291289046
HIGD1AP1: Information Gain = 0.2553839469424519
PLEKHO1: Information Gain = 0.2553101998890821
TLE6: Information Gain = 0.255220096358447
ACTBP15: Information Gain = 0.25520234667051245
MITF: Information Gain = 0.25515253987196607
PKDCC: Information Gain = 0.2549932563848616
ARFRP1: Information Gain = 0.25492483093829654
FTH1P12: Information Gain = 0.2549213619167505
MIR210: Information Gain = 0.2549022530376841
MEF2A: Information Gain = 0.25489820765025173
REEP2: Information Gain = 0.2548832615994545
OTX1: Information Gain = 0.2548085094936017
VXN: Information Gain = 0.25475498285944753
SLK: Information Gain = 0.25471170883422967
PARM1: Information Gain = 0.2545795133591311
TSPAN12: Information Gain = 0.2545592977517994
NIBAN1: Information Gain = 0.25451738955894987
TOX2: Information Gain = 0.25447901381712246
CFAP418-AS1: Information Gain = 0.2544155786229587
MYBL1: Information Gain = 0.25430824556168274
MIR34AHG: Information Gain = 0.254289640234179
SINHCAFP1: Information Gain = 0.25422625328191084
GLUD1P3: Information Gain = 0.25420163536573126
FTH1P15: Information Gain = 0.25399151303071177
ANAPC5: Information Gain = 0.25394233033138325
G6PC3: Information Gain = 0.2538309888632868
CASTOR3: Information Gain = 0.25382795178544804
BTG1-DT: Information Gain = 0.253803602429747
TPM4: Information Gain = 0.25360400360363755
CYFIP2: Information Gain = 0.2535144418196691
DPAGT1: Information Gain = 0.25351128208898555
GATA2: Information Gain = 0.25348625612874787
ASNS: Information Gain = 0.25335502057379466
SEL1L: Information Gain = 0.25315175188112726
RUSC1: Information Gain = 0.2531214197689864
RN7SL674P: Information Gain = 0.25308008285098693
RCN3: Information Gain = 0.25298384077634917
CALM3: Information Gain = 0.252969408466841
ABHD8: Information Gain = 0.2529425493863364
LPIN3: Information Gain = 0.252869690728168
ZMPSTE24-DT: Information Gain = 0.2528365753723889
DNAAF10: Information Gain = 0.25279370495677855
SNW1: Information Gain = 0.25279106755565883
S100A4: Information Gain = 0.25272116126319766
LSS: Information Gain = 0.25271784685136867
DSC2: Information Gain = 0.2526880117367696
EGFR-AS1: Information Gain = 0.252542042113169
DUSP2: Information Gain = 0.2524768358181293
MLKL: Information Gain = 0.2524625508062326
C21orf58: Information Gain = 0.25242126871876747
CRYBG3: Information Gain = 0.25226409025065344
POLE2: Information Gain = 0.2520793748143195
STX3: Information Gain = 0.2520580764428526
LERFS: Information Gain = 0.25198471366165176
EXOG: Information Gain = 0.2519532656282759
TOP2A: Information Gain = 0.25185928565477056
PLBD1-AS1: Information Gain = 0.25185860525908454
NAV1: Information Gain = 0.251856415211378
ATP6V1G1: Information Gain = 0.25182947324980565
TK1: Information Gain = 0.2518265705579086
CFAP251: Information Gain = 0.2517743816455025
TPTE2: Information Gain = 0.2516430162997796
CAVIN2: Information Gain = 0.2515064715935016
KRT19: Information Gain = 0.2514873694041604
CLEC3A: Information Gain = 0.25139535782468236
RELN: Information Gain = 0.2513713502378456
EGR3: Information Gain = 0.2513295198801686
HMGN3: Information Gain = 0.25129694953908377
HES2: Information Gain = 0.25120546093347884
DUSP8: Information Gain = 0.2511739822704
KIF5B: Information Gain = 0.25105445078291755
MCM6: Information Gain = 0.25094091578110245
HOXA10-AS: Information Gain = 0.2509402499128883
EFEMP2: Information Gain = 0.25091698280952546
CALR4P: Information Gain = 0.25086118345482156
DNER: Information Gain = 0.2508488983641215
BMF-AS1: Information Gain = 0.25082371651556823
GAPDHP68: Information Gain = 0.2507564096698707
SERPINE2: Information Gain = 0.2507188445182167
FBP1: Information Gain = 0.25068640030406875
BMS1P10: Information Gain = 0.25063886848766503
KRT18P46: Information Gain = 0.25056640668990493
MMP13: Information Gain = 0.25055276210213284
GAPDHP32: Information Gain = 0.2505192997886956
ADAMTS9-AS2: Information Gain = 0.2503647258807713
KBTBD2: Information Gain = 0.2503367570449522
SERTAD2: Information Gain = 0.2503324564654821
RGS20: Information Gain = 0.25031113823882767
C2CD2: Information Gain = 0.2502969663194057
MIR7113: Information Gain = 0.25026484094388346
PPP1R3E: Information Gain = 0.25019572808285795
ARID3A: Information Gain = 0.25005688482750843
ERICH6-AS1: Information Gain = 0.24992737452815517
STAG3: Information Gain = 0.24986176050500886
RAMP2: Information Gain = 0.24979395299735563
LRP4-AS1: Information Gain = 0.24978877156021206
GPR139: Information Gain = 0.24978287519963582
SYNE3: Information Gain = 0.2497686320343837
CPA6: Information Gain = 0.2496866903015571
GLRA3: Information Gain = 0.24951886232946818
ERLNC1: Information Gain = 0.2495002502945829
EEF1A1P13: Information Gain = 0.24935505458842488
WSCD1: Information Gain = 0.24933922253041718
PTTG1IP: Information Gain = 0.2491458001336242
SDK1-AS1: Information Gain = 0.249044509543362
FLOT2: Information Gain = 0.24892132963445324
MFSD11: Information Gain = 0.24889488091891554
TOX3: Information Gain = 0.24882300461383955
PLXNA2: Information Gain = 0.24877200147015777
TNNT1: Information Gain = 0.24869560962514337
PHLDB2: Information Gain = 0.24866869026688798
LIN7A: Information Gain = 0.248622745440084
IDS: Information Gain = 0.248599739920095
ANXA3: Information Gain = 0.24856346230153847
SCGB2A1: Information Gain = 0.24854435500586436
DHX40: Information Gain = 0.24847001656476397
GLIDR: Information Gain = 0.2484643202850607
IL17RB: Information Gain = 0.2483320438636627
KRT16: Information Gain = 0.2483029630227287
ANK2: Information Gain = 0.24827561277898758
CHAF1B: Information Gain = 0.24825734852735426
ZMAT4: Information Gain = 0.24822845844753538
CYB5B: Information Gain = 0.24815341814701353
SRD5A3-AS1: Information Gain = 0.24814017995767546
SLC47A1: Information Gain = 0.24808639786792197
SPA17: Information Gain = 0.2480627202086385
LRP2: Information Gain = 0.2480354338882762
ACTG1P12: Information Gain = 0.24792471921538106
SMIM15: Information Gain = 0.24792055052278839
NAXE: Information Gain = 0.24789529673023214
ZNF524: Information Gain = 0.24786576265489635
THEG: Information Gain = 0.24786164775243602
RANGRF: Information Gain = 0.2478589861653362
FNDC10: Information Gain = 0.24784370604918116
ISOC1: Information Gain = 0.24780862974264872
TRIM16L: Information Gain = 0.24779957732344893
GPRC5A: Information Gain = 0.24773820089944776
MID1: Information Gain = 0.24769986799681454
ERRFI1: Information Gain = 0.24767831237714555
CCDC71: Information Gain = 0.24762081388256418
MLEC: Information Gain = 0.2476188497069094
TONSL: Information Gain = 0.24758223037283633
CCR3: Information Gain = 0.24757386624598676
COL9A2: Information Gain = 0.24753187407415655
C1QTNF6: Information Gain = 0.2474992541739427
COL17A1: Information Gain = 0.2474431866391309
TM7SF2: Information Gain = 0.24731925279566358
SYNGR3: Information Gain = 0.24731825303892374
KHDC1: Information Gain = 0.24729100391234016
RGS17: Information Gain = 0.24727714177218596
C1R: Information Gain = 0.2471836493274846
ACSS1: Information Gain = 0.24715593668601432
TENM3-AS1: Information Gain = 0.24715310820720826
SERINC1: Information Gain = 0.24712296028929415
LINC01659: Information Gain = 0.2470604359243609
FOXRED1: Information Gain = 0.2470452640735621
MUC12-AS1: Information Gain = 0.24702283750942833
FTH1P7: Information Gain = 0.24691272274668097
HERC3: Information Gain = 0.24689651516580557
TATDN1P1: Information Gain = 0.24686701732789262
KRT17: Information Gain = 0.24683117474196248
NUAK1: Information Gain = 0.24682772877741788
PGLYRP2: Information Gain = 0.24677170482703503
MCUB: Information Gain = 0.24674147099496557
MYORG: Information Gain = 0.24660831325584565
ACTR3C: Information Gain = 0.24648125393082743
TMCC3: Information Gain = 0.24635959829949616
NPY1R: Information Gain = 0.24622830701972265
LRRC45: Information Gain = 0.24617243115157184
BLNK: Information Gain = 0.24616957706747855
NAMPTP1: Information Gain = 0.24615151950972147
MIR3917: Information Gain = 0.24608911970121738
CSTF3: Information Gain = 0.2460340080491099
FOXP2: Information Gain = 0.24601708134383204
FOXI3: Information Gain = 0.2459613487242629
GAPDHP44: Information Gain = 0.24590084654076105
YPEL5: Information Gain = 0.24584465255417287
RN7SL1: Information Gain = 0.24575169372648964
PRKAA2: Information Gain = 0.24567323599873658
SPATA12: Information Gain = 0.24545635496684493
PTPRR: Information Gain = 0.24545507119223275
COQ4: Information Gain = 0.24542226662594802
DPCD: Information Gain = 0.24536687629178555
CCND3: Information Gain = 0.24523069900878802
ARHGEF28: Information Gain = 0.24512182233871127
MKRN4P: Information Gain = 0.24506969164666748
TMEM45B: Information Gain = 0.24504738461367914
ATP6AP1L: Information Gain = 0.2449908102859597
MIR6819: Information Gain = 0.24496471188852875
FTH1P8: Information Gain = 0.24494051297494113
SBK1: Information Gain = 0.2449287834449423
SUOX: Information Gain = 0.24491499544640982
MEAF6: Information Gain = 0.24487018338882405
MAGEF1: Information Gain = 0.2448694827098863
ATP5MG: Information Gain = 0.244854807562914
RBP7: Information Gain = 0.24474188145401854
MAB21L3: Information Gain = 0.2446786504243803
GALR2: Information Gain = 0.2446604468068374
WASF4P: Information Gain = 0.24462178582580263
ARL6IP1P2: Information Gain = 0.2446019633679959
SARS1: Information Gain = 0.2445966725363582
MIR6811: Information Gain = 0.24457768821298687
ZNF766: Information Gain = 0.24452335544904957
DOCK11: Information Gain = 0.24448815848220473
CHST14: Information Gain = 0.24442249999217736
NUDT6: Information Gain = 0.2444129995798876
ECI1: Information Gain = 0.24431868934838485
SOWAHC: Information Gain = 0.2443178084602311
TOMM40P2: Information Gain = 0.24424363377821945
SEPHS1P4: Information Gain = 0.24414518497759063
RPS12P26: Information Gain = 0.2441150880535503
HSPB1P2: Information Gain = 0.2440886171984169
LONRF2: Information Gain = 0.24406980012797352
THEMIS2: Information Gain = 0.24406636917219582
CNPY4: Information Gain = 0.24398523653229964
DTYMK: Information Gain = 0.243983459122306
ABCB8: Information Gain = 0.2439529536335583
TMEM132B: Information Gain = 0.24388615213609666
HS6ST3: Information Gain = 0.2438318665213861
SOD2-OT1: Information Gain = 0.24382830060664773
ID2-AS1: Information Gain = 0.24378743599418717
ETV6: Information Gain = 0.2437093602035718
CCDC74B: Information Gain = 0.24366562690877447
DPT: Information Gain = 0.24365950761627309
CSGALNACT1: Information Gain = 0.24365153582849053
KCNN1: Information Gain = 0.24355700311347728
ZNF70: Information Gain = 0.2435468639997509
TIGD3: Information Gain = 0.24351609671637076
RHPN1-AS1: Information Gain = 0.2434849233533931
MALRD1: Information Gain = 0.24347610421096677
KRT89P: Information Gain = 0.2434514779810415
DACT3-AS1: Information Gain = 0.2434238135052158
PPP1R3B: Information Gain = 0.24342353421713603
CHAC1: Information Gain = 0.24338906798077486
ATG14: Information Gain = 0.2433754042396843
SEPSECS-AS1: Information Gain = 0.24335422852013555
ARHGEF35-AS1: Information Gain = 0.24328824624722278
IL17D: Information Gain = 0.24319964545901795
STMN4: Information Gain = 0.2431991000098015
DEPDC4: Information Gain = 0.2431570299167174
GINS1: Information Gain = 0.2431060153208553
MRTFA: Information Gain = 0.24294793064351428
MUC5B-AS1: Information Gain = 0.24293735805468253
LRG1: Information Gain = 0.24293615068394625
AXL: Information Gain = 0.2429185410625616
MCOLN3: Information Gain = 0.24289864768455582
OR2A9P: Information Gain = 0.24271949200042364
TNFRSF10B: Information Gain = 0.2427188107048095
MELTF: Information Gain = 0.24269685592380075
PTH1R: Information Gain = 0.24263054399885986
ZNF264: Information Gain = 0.24258303803765902
RTL8B: Information Gain = 0.242565986585922
MIR6830: Information Gain = 0.24253533227578883
DTNA: Information Gain = 0.24249583638201577
PKD1P6: Information Gain = 0.24249092986577114
OPLAH: Information Gain = 0.2424595925470232
FGD2: Information Gain = 0.24241172840572633
SUMO3: Information Gain = 0.24237865328380437
IGHE: Information Gain = 0.24237094691127625
ANXA2: Information Gain = 0.24236232697644589
CDYL: Information Gain = 0.24232496351526622
LINC01615: Information Gain = 0.2423067613893226
MRPL12: Information Gain = 0.24229740688538448
ASPM: Information Gain = 0.24227833832343482
CDC6: Information Gain = 0.24225926665669584
GTSE1: Information Gain = 0.24223185585656193
IFNAR2: Information Gain = 0.24222975143953174
FAS: Information Gain = 0.24222944690931514
UMODL1: Information Gain = 0.2421623236136552
SH3RF2: Information Gain = 0.2421504501205698
DIPK2A: Information Gain = 0.24206574024666505
E2F1: Information Gain = 0.24205173055049678
CORO1C: Information Gain = 0.24203233926590917
CDC42EP2: Information Gain = 0.24202508753041063
RUNX2: Information Gain = 0.24201327753927648
CCL22: Information Gain = 0.24198715109136626
MDK: Information Gain = 0.2419195907355769
MIR4743: Information Gain = 0.24190335861019197
GRPEL2: Information Gain = 0.24188785003571378
PALM2AKAP2: Information Gain = 0.24187499308519467
RAB37: Information Gain = 0.24186648415019896
SVIL: Information Gain = 0.2418331584876361
MAP7D2: Information Gain = 0.24170269658723686
PPP2CA-DT: Information Gain = 0.24164272755544647
NAGS: Information Gain = 0.24156920417340055
EMID1: Information Gain = 0.24147075980024657
C1QTNF7-AS1: Information Gain = 0.24143564829260744
GREB1: Information Gain = 0.24139407259531875
RNF41: Information Gain = 0.24137902632608998
NUDT1: Information Gain = 0.24136036996836752
SOX11: Information Gain = 0.24133374573439026
IFRD1: Information Gain = 0.2413154143743652
PPP1CB: Information Gain = 0.2412922568833804
CDH11: Information Gain = 0.24123048707862194
MIR761: Information Gain = 0.241229988893519
ZBTB20-AS1: Information Gain = 0.24121089680672636
ZDHHC9: Information Gain = 0.24120986094820585
PDGFC: Information Gain = 0.24114249165642487
ADPRH: Information Gain = 0.24113212530093908
CPLANE2: Information Gain = 0.24109235118591266
RNU6-8: Information Gain = 0.24103300019442853
CYBA: Information Gain = 0.24102782298878678
TMCO3: Information Gain = 0.24092495913103162
RFX3-AS1: Information Gain = 0.2408675259990367
S1PR5: Information Gain = 0.2408065355440523
PKD2: Information Gain = 0.24074506622407688
FTH1P11: Information Gain = 0.24072065789581298
GOLGA2P5: Information Gain = 0.24070495979656403
ZNF610: Information Gain = 0.2406276083824883
MIR3198-2: Information Gain = 0.2405905291450514
DSCAM: Information Gain = 0.24055367914927372
SMARCE1P5: Information Gain = 0.24052738229100568
LIF: Information Gain = 0.240447126997833
CAVIN2-AS1: Information Gain = 0.24043490168646597
LINC00526: Information Gain = 0.2404319632938534
CHML: Information Gain = 0.24037383537933343
SPTBN4: Information Gain = 0.2403107525131638
LINC00598: Information Gain = 0.24022177729758654
LNC-LBCS: Information Gain = 0.24013693661200297
C12orf60: Information Gain = 0.2400221372919782
CLGN: Information Gain = 0.23998849951225676
ARL2BPP4: Information Gain = 0.23996953882175265
KCTD11: Information Gain = 0.23993501502118542
CXCR4: Information Gain = 0.23992450525135367
ASPH: Information Gain = 0.23989643415348194
KIF4A: Information Gain = 0.23987853784208468
SKA3: Information Gain = 0.23981006240540315
HS3ST1: Information Gain = 0.23979672347933323
C19orf38: Information Gain = 0.23978956421965947
GRIN2C: Information Gain = 0.2397878943684082
CDKL2: Information Gain = 0.2397734154710771
SPRR1B: Information Gain = 0.23970548838346883
CENPX: Information Gain = 0.2396402433608844
DRAIC: Information Gain = 0.23961612969445145
NCMAP-DT: Information Gain = 0.23959014121124222
PAOX: Information Gain = 0.23952058375573304
YBX2: Information Gain = 0.23947493108653517
SEPTIN11: Information Gain = 0.23947387449057045
FCHO2-DT: Information Gain = 0.23935894083917875
LNX2: Information Gain = 0.2393201252301993
ZRANB1: Information Gain = 0.23928992717933162
NEK9: Information Gain = 0.23925995636978103
CEP19: Information Gain = 0.23916674893010237
LPAR3: Information Gain = 0.23911807907166227
NR3C1: Information Gain = 0.2390874908127547
WEE2: Information Gain = 0.23907931898054247
STMN1: Information Gain = 0.23905657335783292
OTOS: Information Gain = 0.23903516669235736
MIF4GD: Information Gain = 0.23898345947423727
NPEPPSP1: Information Gain = 0.23898275737812424
FAM177B: Information Gain = 0.23897891830996754
SIPA1L2: Information Gain = 0.23896821746677133
TMEM105: Information Gain = 0.23895927415153184
LINC02889: Information Gain = 0.23894434107541174
ANKRD22: Information Gain = 0.23894130574983463
PXDC1: Information Gain = 0.23892327058294427
GAMT: Information Gain = 0.23891690179067115
ISM2: Information Gain = 0.23884457954500116
TMPRSS9: Information Gain = 0.23881978368042756
FTH1P2: Information Gain = 0.23880428629135309
ARHGEF34P: Information Gain = 0.23872530215740007
GDAP1: Information Gain = 0.23866152616604008
NF2: Information Gain = 0.23857938739071916
SPRED1: Information Gain = 0.23848852074389537
BTC: Information Gain = 0.23846497956561263
TRIM60P18: Information Gain = 0.23840118963960855
MEX3D: Information Gain = 0.23834088404862852
IFI16: Information Gain = 0.23830656094966685
GDPD3: Information Gain = 0.23829516021764463
NAV2: Information Gain = 0.2382463719009229
MIR636: Information Gain = 0.23820452275806914
HSD17B14: Information Gain = 0.23815660851975884
CLPSL1: Information Gain = 0.23814587932737608
KCNJ8: Information Gain = 0.238132787204677
GSC: Information Gain = 0.23812298905882634
PCAT7: Information Gain = 0.2381218509683869
LINC00636: Information Gain = 0.23809906653897817
PRRC1: Information Gain = 0.23807386989319235
HSH2D: Information Gain = 0.23806249603727725
TIMELESS: Information Gain = 0.23805549167841145
CREB5: Information Gain = 0.23800618352892355
TRAV18: Information Gain = 0.2379902999503103
PHC2-AS1: Information Gain = 0.23797257375761527
PTGFRN: Information Gain = 0.23796976887233812
PRELID1: Information Gain = 0.2379297598248329
SEMA6C: Information Gain = 0.23777863335181104
PAG1: Information Gain = 0.23777382195360475
OR7E39P: Information Gain = 0.23777246495086168
GLT1D1: Information Gain = 0.23776055683180486
AGBL2: Information Gain = 0.23774843491220343
FAM178B: Information Gain = 0.23774283925390538
ST13P6: Information Gain = 0.23772827417015363
LHX2: Information Gain = 0.2376805312321537
ZNNT1: Information Gain = 0.23760024400805424
HSPB1P1: Information Gain = 0.23759926272158194
CORO1A-AS1: Information Gain = 0.2375602659481697
THRIL: Information Gain = 0.23754919867793478
SNRPGP15: Information Gain = 0.23753792983117394
C2CD4C: Information Gain = 0.23751637677693038
DDX59: Information Gain = 0.23751408944362318
NPY5R: Information Gain = 0.23751311003622644
FYB2: Information Gain = 0.23749218699302888
MAP1A: Information Gain = 0.2374831447535506
COL13A1: Information Gain = 0.23748212122914336
ID4: Information Gain = 0.2374450569784332
IL12A-AS1: Information Gain = 0.23743069670453387
TAGAP-AS1: Information Gain = 0.2373985740200193
LINC00824: Information Gain = 0.23738477943777125
GOLGA5: Information Gain = 0.23737153318966442
GCNT3: Information Gain = 0.23736754743153932
OR7E126P: Information Gain = 0.23736094808905883
FDX2: Information Gain = 0.23734191586779851
KCTD17: Information Gain = 0.23731800638533262
PRICKLE2-DT: Information Gain = 0.23729321730140063
GBX2: Information Gain = 0.23725309092781877
EDARADD: Information Gain = 0.23722195452920602
IL20: Information Gain = 0.23720942100021558
FAM230I: Information Gain = 0.23720322351636125
MIR6785: Information Gain = 0.23719219093372335
RPL7P6: Information Gain = 0.23718146987514555
NUSAP1: Information Gain = 0.2371367562106823
CMKLR2: Information Gain = 0.23712266517254932
LRRC3: Information Gain = 0.2371001749770143
MAF: Information Gain = 0.2370461363178531
C14orf132: Information Gain = 0.23702971688368235
TNIK: Information Gain = 0.23700778159979508
DINOL: Information Gain = 0.23700376793126066
DNAH10OS: Information Gain = 0.23700207320299338
ARIH1: Information Gain = 0.23699417466945505
FGF13: Information Gain = 0.23697182300404918
RPL7P47: Information Gain = 0.23692857407734813
SWAP70: Information Gain = 0.23673224350368738
HS6ST2: Information Gain = 0.2367183397206014
LINC01977: Information Gain = 0.2366813629778881
LINC00629: Information Gain = 0.23667889747847792
LINC00866: Information Gain = 0.23667436652482987
MIR6765: Information Gain = 0.23665679150495955
ZNF304: Information Gain = 0.23665578251731723
PEX5: Information Gain = 0.23663812325405775
THRSP: Information Gain = 0.2365808727175831
FTH1P5: Information Gain = 0.23656997860861484
CDKN1A: Information Gain = 0.23650485411983602
STAB1: Information Gain = 0.2364818275796634
PHGDH: Information Gain = 0.23648096360710835
LINC01340: Information Gain = 0.23646319387558923
MCM7: Information Gain = 0.23645408094351605
ALOX5: Information Gain = 0.23642469020459345
ZMYM5: Information Gain = 0.2364217974831211
DCLK2: Information Gain = 0.2364194873469705
ECPAS: Information Gain = 0.23641657123361748
ABHD4: Information Gain = 0.23637866115833006
RPL4P6: Information Gain = 0.2363036868995687
FGFR4: Information Gain = 0.2362851261834915
KLKP1: Information Gain = 0.23628109535672048
SUMO2P17: Information Gain = 0.23625510052834864
ARHGAP22: Information Gain = 0.23623946532772178
P4HA3-AS1: Information Gain = 0.23617976504243576
SCGB1D2: Information Gain = 0.23615145338248777
SPATA6: Information Gain = 0.23614821891595095
SMU1P1: Information Gain = 0.23611168508326363
RSL1D1: Information Gain = 0.2361066542375616
ZNF460: Information Gain = 0.23608134312003792
MIDEAS: Information Gain = 0.2360607398333634
SND1-IT1: Information Gain = 0.23605486123649144
ACKR2: Information Gain = 0.23602877185842663
SUMO2P21: Information Gain = 0.23595731353440552
ANKRD34A: Information Gain = 0.23593691851440757
CAD: Information Gain = 0.23591446181258768
ZMAT1: Information Gain = 0.23588003890349096
TDRD12: Information Gain = 0.2358469247037367
TRBV30: Information Gain = 0.23584247285696702
RAC3: Information Gain = 0.2358360565867108
SULT2B1: Information Gain = 0.23582814129391894
C11orf98: Information Gain = 0.23582718221021848
ZNF841: Information Gain = 0.2358014367939365
P3H2: Information Gain = 0.23578790584651532
GJB5: Information Gain = 0.2357723218463399
SNAP91: Information Gain = 0.23571822615296179
HDLBP: Information Gain = 0.23571571294198534
NQO2-AS1: Information Gain = 0.2355932218401282
ANKRD1: Information Gain = 0.23555654903596346
CCDC80: Information Gain = 0.23553204210950018
KY: Information Gain = 0.23549463636774814
SPINK8: Information Gain = 0.2354582553776967
IL6R: Information Gain = 0.23544375224590008
PCDH20: Information Gain = 0.23542105346883746
ACTG1P20: Information Gain = 0.23535656233267455
RBP1: Information Gain = 0.23530991647445965
SPTLC3: Information Gain = 0.23530633626513753
GAPDHP38: Information Gain = 0.2352794753465186
OIP5: Information Gain = 0.23526578168307188
DNAJB6P2: Information Gain = 0.23524576603003844
SERPINB5: Information Gain = 0.23519217480665922
DHRS7: Information Gain = 0.2351873236701587
ESCO2: Information Gain = 0.2351739514749529
MIR4737: Information Gain = 0.23516785134824114
GATA5: Information Gain = 0.235154459441397
NCAPH: Information Gain = 0.23512567882719893
CLSPN: Information Gain = 0.23510814502237776
MIR6833: Information Gain = 0.2351052539645786
PPP2R2A: Information Gain = 0.23507915515739297
MIR4428: Information Gain = 0.2350300007826278
CDH13: Information Gain = 0.2350221825968586
GAPDH-DT: Information Gain = 0.23500945620925529
RNF157: Information Gain = 0.23500545349828772
GJA3: Information Gain = 0.2349905071603775
TMTC1: Information Gain = 0.23498078656628785
ZNF853: Information Gain = 0.23494329950921777
GATA2-AS1: Information Gain = 0.23490809800577184
ATAD5: Information Gain = 0.2348975652410843
MIR4793: Information Gain = 0.23489316102954616
ZNF710: Information Gain = 0.2348831226606407
COL4A3: Information Gain = 0.2347762367264199
FTH1P10: Information Gain = 0.23477602905858386
PPFIBP2: Information Gain = 0.23476177016773625
TMPRSS13: Information Gain = 0.23474004393427905
AFAP1-AS1: Information Gain = 0.23473132799331253
NEK2: Information Gain = 0.23468818933868718
ANK1: Information Gain = 0.2346605572180216
SNORD35B: Information Gain = 0.234659677226716
BTG3-AS1: Information Gain = 0.23463223243992215
MIR6730: Information Gain = 0.23461925006747086
BMP6: Information Gain = 0.23457173426443045
ZDHHC11B: Information Gain = 0.2345582194555198
MARK3: Information Gain = 0.23453120878207168
NCOR2: Information Gain = 0.23451631114332994
CALM2P2: Information Gain = 0.2345034363982481
ADAM20P1: Information Gain = 0.2344745496315921
IL18: Information Gain = 0.23444633658426217
SCHLAP1: Information Gain = 0.2343876237319984
CDH16: Information Gain = 0.2343726407828981
ZBTB20: Information Gain = 0.23434826657879393
LINC02343: Information Gain = 0.23433802685512095
ZNF697: Information Gain = 0.2341827352160084
OXER1: Information Gain = 0.23417849468199115
CCDC148-AS1: Information Gain = 0.2341705768194533
EIF2S2P3: Information Gain = 0.23415706648183066
ZNF654: Information Gain = 0.23413596586315832
KLHDC8B: Information Gain = 0.23409236043410808
EN2: Information Gain = 0.23406175589778289
EFNB1: Information Gain = 0.23402197673087644
ALDOC: Information Gain = 0.23399372701590293
HGH1: Information Gain = 0.23392852117873608
SNORD69: Information Gain = 0.23390800063854722
INTS4P1: Information Gain = 0.2338783188774325
NDUFB8P2: Information Gain = 0.23378741775522194
NBEAP5: Information Gain = 0.23374895921749395
MBOAT7: Information Gain = 0.23373229807129725
ACSBG1: Information Gain = 0.2337180421091536
LINC01016: Information Gain = 0.23371112463784782
EIF4H: Information Gain = 0.23370231910041062
LINC01529: Information Gain = 0.23367982063349602
FGD3: Information Gain = 0.23365611534501785
FAM83G: Information Gain = 0.2335551356484875
RRAS: Information Gain = 0.23355231840434998
STX17-DT: Information Gain = 0.2335406068507735
UBASH3B: Information Gain = 0.23352974466419285
CCDC137: Information Gain = 0.23350652021129936
HLF: Information Gain = 0.23349848866088485
PPP1R9A: Information Gain = 0.23349440330080684
IRF2-DT: Information Gain = 0.2333809392049604
CAPN8: Information Gain = 0.23336648475071864
DLX5: Information Gain = 0.23333269377863286
PTGES: Information Gain = 0.2333086555652979
KCNIP4: Information Gain = 0.23329945689950327
OXR1-AS1: Information Gain = 0.23328325183729448
LHX6: Information Gain = 0.23327352226642972
PIGW: Information Gain = 0.2332446083488866
VN1R48P: Information Gain = 0.23321735376362973
MIR6865: Information Gain = 0.23319675239242144
FEM1B: Information Gain = 0.23319671170696443
EMILIN3: Information Gain = 0.2331913137297872
MIR4640: Information Gain = 0.23315423632717547
IL17C: Information Gain = 0.23309380249306555
MIR6866: Information Gain = 0.23306262443230308
RNF122: Information Gain = 0.23298730807710188
LINC02656: Information Gain = 0.2329709496873833
ZNF295-AS1: Information Gain = 0.232966130161665
SLC25A5: Information Gain = 0.2329449371433967
CCDC175: Information Gain = 0.23293304214092014
C7orf61: Information Gain = 0.23285278822323763
RASGEF1C: Information Gain = 0.2328295211424114
ABCC4: Information Gain = 0.23281774176640968
EMP1: Information Gain = 0.232816567738523
CACNA1C: Information Gain = 0.23279890471247433
FBXL7: Information Gain = 0.2327874470768816
TFF2: Information Gain = 0.23278707820711264
SRD5A3: Information Gain = 0.23275601543143143
KRT87P: Information Gain = 0.23274842365349957
PLEKHB1: Information Gain = 0.23272064049653074
MANCR: Information Gain = 0.23270259318008435
GCHFR: Information Gain = 0.23269121398471637
HBEGF: Information Gain = 0.2326535283469089
DMRT1: Information Gain = 0.23253757023935306
TOMM40P1: Information Gain = 0.23247253904359688
GPR132: Information Gain = 0.232458067931969
SNORD56: Information Gain = 0.23245679520410034
CNIH2: Information Gain = 0.23245052638634078
ALDH3A1: Information Gain = 0.23239599183341864
P2RX2: Information Gain = 0.23232648950449297
NKPD1: Information Gain = 0.23226547382185148
HEBP2: Information Gain = 0.232261418779806
S1PR4: Information Gain = 0.2322485154838838
PRAP1: Information Gain = 0.23224724146529607
PCSK5: Information Gain = 0.2322430028003648
EFCAB6-DT: Information Gain = 0.23223037082316322
GPAA1: Information Gain = 0.23222132118302308
MT-TS2: Information Gain = 0.23220516858278328
IRX4: Information Gain = 0.23217588539569478
GUCY2C: Information Gain = 0.23217581710357305
SORCS1: Information Gain = 0.23212525404585627
ZFP69B: Information Gain = 0.23210107983836314
OR7E36P: Information Gain = 0.2320763163044426
SLC4A8: Information Gain = 0.23199545715791192
LARGE2: Information Gain = 0.2319933300211745
RACGAP1: Information Gain = 0.2319765556466471
FAM83E: Information Gain = 0.2319397992466179
LAPTM5: Information Gain = 0.231930572741474
GABARAPL1: Information Gain = 0.23192454510472915
AFF3: Information Gain = 0.23189708853926883
KCNN3: Information Gain = 0.2318955630511017
SMPD5: Information Gain = 0.23169334194735525
OTOAP1: Information Gain = 0.2316768299242058
PPP1R14BP2: Information Gain = 0.23166161097503934
NEIL3: Information Gain = 0.23162061798097455
LINGO3: Information Gain = 0.23161261349599593
SPX: Information Gain = 0.23160753268229683
VCP: Information Gain = 0.2315938924680503
TMEM51-AS1: Information Gain = 0.23157210545311702
SMOC2: Information Gain = 0.2315269279483616
GATD3A: Information Gain = 0.2314951890054211
SFXN5: Information Gain = 0.23149098729416684
MIR6775: Information Gain = 0.23146620682255858
AGPAT4: Information Gain = 0.23146117763087837
ZNF333: Information Gain = 0.2314560201583995
CSRP2: Information Gain = 0.23140629739882113
NUGGC: Information Gain = 0.2314019797526865
RPL23AP49: Information Gain = 0.2313942614840634
ACRV1: Information Gain = 0.2313729340737185
ANTKMT: Information Gain = 0.23137206392080745
ATP6V1D: Information Gain = 0.23133762926655255
TCIRG1: Information Gain = 0.23131646941986395
CCDC87: Information Gain = 0.23124092017082543
NPIPB2: Information Gain = 0.2312298621470521
ELAC2: Information Gain = 0.23120505494305132
EIF4A1P5: Information Gain = 0.2312017818026566
KRT23: Information Gain = 0.2311784125083518
RACK1P1: Information Gain = 0.23117404483375936
MSLNL: Information Gain = 0.2311665218427632
HPGD: Information Gain = 0.23111044588020824
ADGRE2: Information Gain = 0.23108769521333272
USH1G: Information Gain = 0.23106856603183967
DLEU2L: Information Gain = 0.23106739102246787
SHLD1: Information Gain = 0.23105536014662276
EIF4BP5: Information Gain = 0.23105319052878603
TRPC6: Information Gain = 0.23100914381351956
SNORD62B: Information Gain = 0.2309993331914897
LINC01176: Information Gain = 0.23099913464757882
KCNJ3: Information Gain = 0.23099506880055198
CSF1: Information Gain = 0.2309821078313219
TSPAN13: Information Gain = 0.23093687043601463
CDKN2C: Information Gain = 0.2309138826057031
MASP1: Information Gain = 0.23087644840274457
MIR4751: Information Gain = 0.23086448889066213
PVRIG: Information Gain = 0.23086386616603582
LINC01164: Information Gain = 0.23085150908555563
FRG1HP: Information Gain = 0.23082259763193336
PLAGL1: Information Gain = 0.23080793212493678
CASC15: Information Gain = 0.23079779391664745
LCN2: Information Gain = 0.23079632403960382
PLA2G2A: Information Gain = 0.23075411006565294
THUMPD1P1: Information Gain = 0.23072701099064896
PLAAT4: Information Gain = 0.23071144201305827
RAB11FIP5: Information Gain = 0.23070240265800956
NDUFA13: Information Gain = 0.23065125946901088
NEDD9: Information Gain = 0.23061832660186554
NT5DC4: Information Gain = 0.2306160914738553
YWHAZP5: Information Gain = 0.23061237366965925
SOWAHA: Information Gain = 0.2305840846264151
PNMA6B: Information Gain = 0.2305245174031838
TRAV19: Information Gain = 0.23052345245181738
LKAAEAR1: Information Gain = 0.23050888465384012
ARMT1: Information Gain = 0.23050317901806827
LRRC10B: Information Gain = 0.23047049413026754
EEF1A1P22: Information Gain = 0.23046974239957163
LRAT: Information Gain = 0.23045169847899327
MARCKS: Information Gain = 0.2304424078168501
GCSHP5: Information Gain = 0.23042914897446742
SNORA10: Information Gain = 0.23042526573913125
CBR1: Information Gain = 0.23039629732986255
KRTAP5-1: Information Gain = 0.2303954155330603
MIR6891: Information Gain = 0.23036869714952735
DLGAP3: Information Gain = 0.230363909577177
FGR: Information Gain = 0.23035000672058903
GSTA4: Information Gain = 0.23034444997069325
C3: Information Gain = 0.23027123510516834
SOCS3-DT: Information Gain = 0.23026389160096494
PSPC1-AS2: Information Gain = 0.23024607334417024
ALDH1L1: Information Gain = 0.2302202557874462
DSG2-AS1: Information Gain = 0.23018186494890003
TNFSF4: Information Gain = 0.2301587652081203
WNT3: Information Gain = 0.23014054940928053
ZNF135: Information Gain = 0.23013528527109983
AMD1: Information Gain = 0.2301311492410425
FAM184A: Information Gain = 0.23012551555252836
SEC1P: Information Gain = 0.23009097530256573
NECTIN4: Information Gain = 0.23005318855418033
LINC00160: Information Gain = 0.23004866839005889
CR2: Information Gain = 0.23002674007733503
CD68: Information Gain = 0.23001967522129774
SFTPA2: Information Gain = 0.2299979308351403
SNORA77B: Information Gain = 0.22999419404023946
MAB21L4: Information Gain = 0.2299914859348775
CTAGE15: Information Gain = 0.22993586350473794
PLAC9P1: Information Gain = 0.22992279849585584
SLC8A1-AS1: Information Gain = 0.22989654122860226
ANKRD17-DT: Information Gain = 0.22988082153911749
TRIL: Information Gain = 0.22984279516678674
EGFLAM: Information Gain = 0.22983572870916524
MIR6741: Information Gain = 0.2298330448311312
TUBB1: Information Gain = 0.2298208632217431
KCNK12: Information Gain = 0.2298129346590203
RUNX2-AS1: Information Gain = 0.2298002478803418
CLMN: Information Gain = 0.2297991522081988
VEPH1: Information Gain = 0.22975130258451393
ATP5MF: Information Gain = 0.22972355208489725
LINC01714: Information Gain = 0.2297117617645852
TPBGL-AS1: Information Gain = 0.2296775134937108
ADH6: Information Gain = 0.22966996754391955
RGL1: Information Gain = 0.2296659552252147
CASC19: Information Gain = 0.22966240150290962
DNAH10: Information Gain = 0.22962121616853826
RN7SK: Information Gain = 0.22961958453211606
UBE2L4: Information Gain = 0.22961617940460588
ARMC7: Information Gain = 0.2295981215111269
ADGRG5: Information Gain = 0.22955321455626887
DLGAP4-AS1: Information Gain = 0.22954405406888356
PHETA2: Information Gain = 0.22947905594665863
APLP2: Information Gain = 0.22945430620343998
GATA4: Information Gain = 0.22944336555111322
GTF2IP7: Information Gain = 0.2294364963636233
LMCD1: Information Gain = 0.22943289307834758
SNF8: Information Gain = 0.22940427088806614
TTC9-DT: Information Gain = 0.22938428541842537
FGFBP3: Information Gain = 0.22937116277519376
FAM91A2P: Information Gain = 0.22936690567458728
CDK18: Information Gain = 0.22936049898035038
CLUHP10: Information Gain = 0.22931479636863927
SPINK14: Information Gain = 0.22931413335373585
PTPDC1: Information Gain = 0.22928463965749102
DTX4: Information Gain = 0.2292789544246685
GSTM3P2: Information Gain = 0.22926945172274338
LDHAP1: Information Gain = 0.22924470344376924
SNORA12: Information Gain = 0.2292423357682165
NTF4: Information Gain = 0.22919255449738163
GAPDHP52: Information Gain = 0.22915420297501443
NUS1P2: Information Gain = 0.22915199313409662
CCT5P1: Information Gain = 0.229150435937854
PRKCD: Information Gain = 0.22908472183607675
BHLHA15: Information Gain = 0.22907350277702498
RAET1L: Information Gain = 0.22904843864555868
LINC01732: Information Gain = 0.22903372454987636
PHC2: Information Gain = 0.22901479774564204
COLEC10: Information Gain = 0.22899778581053942
RASSF2: Information Gain = 0.22899495328182162
DSCC1: Information Gain = 0.22894491129439287
PGM5P2: Information Gain = 0.228857235247927
ATP5PDP4: Information Gain = 0.22883211917902946
TENT4A: Information Gain = 0.22882904545812788
PPIC: Information Gain = 0.22881323328832037
HAAO: Information Gain = 0.2287965490035444
FOXRED2: Information Gain = 0.2287943331359905
LINC01918: Information Gain = 0.22875583910135067
SYT5: Information Gain = 0.22872988324955146
LINC01290: Information Gain = 0.22872511998743472
POU2F2: Information Gain = 0.22868267694279898
KCNJ18: Information Gain = 0.2286652195504375
KIZ-AS1: Information Gain = 0.22864279808505783
MIR339: Information Gain = 0.22862925709944726
SVIL2P: Information Gain = 0.22860465991529155
APBA1: Information Gain = 0.22860272258759795
RETN: Information Gain = 0.2285986920152283
ZNF337-AS1: Information Gain = 0.22857619961581332
TMEFF1: Information Gain = 0.22857216333360642
LINC02716: Information Gain = 0.22854606378873954
SERPINE1: Information Gain = 0.22854051023410626
MYLK3: Information Gain = 0.22853421625801285
ANO1-AS1: Information Gain = 0.2285231053458523
DBF4B: Information Gain = 0.22851687315132163
ASRGL1: Information Gain = 0.22850274824531747
USP30: Information Gain = 0.22850064792349367
SNX25P1: Information Gain = 0.22848478324144206
CYYR1-AS1: Information Gain = 0.2284669539859867
ADAM20: Information Gain = 0.2284666623466043
CEACAM7: Information Gain = 0.22846132203645442
SMARCD2: Information Gain = 0.22845739649515573
FAT2: Information Gain = 0.22844982647949874
ZNF732: Information Gain = 0.22844556442145159
ASTL: Information Gain = 0.22844548381326457
FRMD6: Information Gain = 0.22842757787723666
TNFAIP3: Information Gain = 0.2284214688942432
TRAF6: Information Gain = 0.22838161966177806
C1RL: Information Gain = 0.2283815778243261
LINC02428: Information Gain = 0.22837262594463592
LINC00173: Information Gain = 0.22836762437775304
PLEKHA2: Information Gain = 0.22835522767117022
SPIN1: Information Gain = 0.2283486892995057
BMP1: Information Gain = 0.2283359008621757
LINC01275: Information Gain = 0.2283352403998522
PDE6D: Information Gain = 0.22833030943374477
ACSM3: Information Gain = 0.2283152441581895
FBXL4: Information Gain = 0.22829407958123427
VWA5A: Information Gain = 0.22829108499257766
SHANK3: Information Gain = 0.22826858750821621
KRT19P1: Information Gain = 0.22826266161390518
TUBAP2: Information Gain = 0.22821577669872362
RPS3AP27: Information Gain = 0.22821570628307386
SYNGR1: Information Gain = 0.22819741267002924
MED28-DT: Information Gain = 0.22818700145674353
MRAP: Information Gain = 0.22815952911094595
MT-TM: Information Gain = 0.22813948229938785
LINC01517: Information Gain = 0.22812486405465848
RLIMP1: Information Gain = 0.22809992146187197
ERVE-1: Information Gain = 0.22809215705527963
RNU6-438P: Information Gain = 0.22808056207894567
MEF2C: Information Gain = 0.22806679808077246
INTU: Information Gain = 0.2280632654274557
ZNF285B: Information Gain = 0.22805870307019194
STK19B: Information Gain = 0.22804827381487924
C6orf58: Information Gain = 0.2280389592173655
LINC02352: Information Gain = 0.22803018369139316
C21orf62-AS1: Information Gain = 0.22802103026748788
AP1B1: Information Gain = 0.22796184894562832
VPS13B-DT: Information Gain = 0.22794349509049106
IFIT2: Information Gain = 0.22791663784369764
KANK3: Information Gain = 0.22790854358490287
TTC9B: Information Gain = 0.2279046306887842
FAM171A1: Information Gain = 0.22786985637529256
CNN2P9: Information Gain = 0.22783807442070492
CCNO-DT: Information Gain = 0.2278319986967665
DHRS9: Information Gain = 0.22782809302275853
PSMG3: Information Gain = 0.22782238745867356
DSG1-AS1: Information Gain = 0.22780916855763178
HKDC1: Information Gain = 0.22780863837786747
PEG13: Information Gain = 0.22779498048381686
HAS2-AS1: Information Gain = 0.22779115137605666
NEU1: Information Gain = 0.2277703881039823
CLIP3: Information Gain = 0.22775238371347784
OR11H13P: Information Gain = 0.22775152765309303
CCR8: Information Gain = 0.2277311435587206
GP2: Information Gain = 0.22771394630184405
PLCL2-AS1: Information Gain = 0.2277032755274766
ZNF133-AS1: Information Gain = 0.2277015307249055
LTB4R2: Information Gain = 0.22768855471276184
SNTN: Information Gain = 0.2276819234964058
CHSY3: Information Gain = 0.22767332573317245
TBC1D24: Information Gain = 0.22762975050879097
TENM4: Information Gain = 0.22762536117268684
GALNT6: Information Gain = 0.22762169433300605
GAL3ST1: Information Gain = 0.22759000160390452
TIGD2: Information Gain = 0.22758298565986435
USP2-AS1: Information Gain = 0.22758288466494903
CYCSP38: Information Gain = 0.2275745909002942
MIR3064: Information Gain = 0.22756249608297896
NR4A3: Information Gain = 0.2275534132181698
LINC01132: Information Gain = 0.22754756028709244
CDA: Information Gain = 0.2275462689914507
ACVR1: Information Gain = 0.22753185600116677
CES5AP1: Information Gain = 0.22753144106744227
GRM1: Information Gain = 0.22749418050802972
SHMT1P1: Information Gain = 0.22747104223539139
RMI2: Information Gain = 0.2274708843735025
IL12A: Information Gain = 0.22745562998509583
ELL2P1: Information Gain = 0.22744901301426856
ABCC1: Information Gain = 0.22744535689747503
LCMT2: Information Gain = 0.22743047640061742
LINC00957: Information Gain = 0.2274188455966053
EPHA8: Information Gain = 0.22740618739162977
PDAP1: Information Gain = 0.2274020397084886
MRPS7: Information Gain = 0.22737187051922825
SNX31: Information Gain = 0.22735819840729454
IGFBP5: Information Gain = 0.22735177674425278
RPL35AP16: Information Gain = 0.2273331035619317
PCDH12: Information Gain = 0.22732971154114678
GRK6P1: Information Gain = 0.22732864766301453
UPK1B: Information Gain = 0.22730810946002866
GAPDHP26: Information Gain = 0.22730390322502436
AFAP1L1: Information Gain = 0.22729730626195854
RPS10P7: Information Gain = 0.22729495182864135
MARK3P3: Information Gain = 0.2272838939441013
MARCHF1: Information Gain = 0.22725490733314446
RFX3: Information Gain = 0.2272165479592092
HNRNPRP1: Information Gain = 0.22721376858723152
TENM3: Information Gain = 0.22720837987545606
GSG1: Information Gain = 0.22719568650836264
TRAPPC1: Information Gain = 0.22719431262756418
GAPDHP45: Information Gain = 0.22719082476852326
EIF1P3: Information Gain = 0.2271889071740265
RNU6-914P: Information Gain = 0.22718183660208968
PRDX3P1: Information Gain = 0.2271612580075002
CGNL1: Information Gain = 0.22714883245418527
TSPAN18: Information Gain = 0.22713928975921127
CHKB-DT: Information Gain = 0.22712640993158306
LBX2: Information Gain = 0.2271192479273212
DNAH3: Information Gain = 0.22708982408389367
PRR22: Information Gain = 0.2270703232959681
ATP4B: Information Gain = 0.22703650702820433
DNMT1: Information Gain = 0.22702306569582853
AKR1C3: Information Gain = 0.22699087944660779
LINC00705: Information Gain = 0.22697349481624696
CRHR2: Information Gain = 0.22697349481624696
MRPL23-AS1: Information Gain = 0.22695418650889598
MIR4658: Information Gain = 0.22691228944083908
CLIP2: Information Gain = 0.22689610340538402
RXRG: Information Gain = 0.22689274256481973
SNX18: Information Gain = 0.22687267772777764
GGT5: Information Gain = 0.22685342067566427
NEDD8: Information Gain = 0.2268518201655607
MIR6875: Information Gain = 0.22684950169879436
VGF: Information Gain = 0.22681178139835412
CCDC9B: Information Gain = 0.22680268337784049
NACA: Information Gain = 0.22678709483002457
AARS1: Information Gain = 0.22676983575144716
IGHG2: Information Gain = 0.2267623518692301
ZBTB32: Information Gain = 0.2267584860054883
DLL3: Information Gain = 0.22675735691001053
ZRANB2-AS2: Information Gain = 0.2267355565578264
LAMB2P1: Information Gain = 0.22672780928509528
HLA-J: Information Gain = 0.22670604850602838
DACH1: Information Gain = 0.22670545253488972
TOR3A: Information Gain = 0.22670418835066042
ICAM3: Information Gain = 0.22669970240701343
PFDN4: Information Gain = 0.22669431037635723
DUOX1: Information Gain = 0.22667840634549585
MPPED2: Information Gain = 0.22665568309145478
HABP2: Information Gain = 0.22664344789226254
NRAP: Information Gain = 0.22664344789226254
KAT6B: Information Gain = 0.22660842709695483
ENHO: Information Gain = 0.22660126075897713
GBAP1: Information Gain = 0.22658633569859066
ANGPT4: Information Gain = 0.2265561377341183
EBF3: Information Gain = 0.22655179922545887
MAPK6P4: Information Gain = 0.22653409845051486
MLXP1: Information Gain = 0.22650691050782235
GRIK5: Information Gain = 0.2265028746939317
ZMAT3: Information Gain = 0.22649566072391614
CEACAM8: Information Gain = 0.22649248117438958
SEMA6D: Information Gain = 0.22645885502924368
PDZK1P1: Information Gain = 0.22642971786729782
SMIM10L2B-AS1: Information Gain = 0.22642451242613992
GALNT5: Information Gain = 0.22642258154574924
LIPK: Information Gain = 0.22641701289332405
CICP4: Information Gain = 0.22640698845378648
AMER2: Information Gain = 0.22640589339035988
SPRY3: Information Gain = 0.2264049576619629
FAR2P2: Information Gain = 0.2263985983508503
FAM219A: Information Gain = 0.22639711175591004
ZFP2: Information Gain = 0.22639686338741516
DPF3: Information Gain = 0.22638908908417066
SCGB1B2P: Information Gain = 0.22637228308586033
PRDM11: Information Gain = 0.22634712085379238
RPL34P18: Information Gain = 0.22633084375036083
ADRB2: Information Gain = 0.22629490690215492
ACE: Information Gain = 0.22627316656825291
WNT11: Information Gain = 0.22625968631184823
LINC01143: Information Gain = 0.22621925416788136
KCND1: Information Gain = 0.22621855415589076
DENND5A: Information Gain = 0.22621485664629803
CNTNAP5: Information Gain = 0.22619860554215054
KIF20A: Information Gain = 0.2261913425987636
KNTC1: Information Gain = 0.22613616517355228
SNORD35A: Information Gain = 0.22613016625497107
UCA1: Information Gain = 0.22612051096401697
FEM1C: Information Gain = 0.22611044636517574
ERICH2: Information Gain = 0.22609200489285008
BRI3: Information Gain = 0.22604587559156974
TBX15: Information Gain = 0.22604527473332725
NEURL2: Information Gain = 0.2260092052910243
LCP2: Information Gain = 0.22600775965682263
KCTD21-AS1: Information Gain = 0.22598867586535665
POFUT2: Information Gain = 0.22598627904365753
UBA52P7: Information Gain = 0.22597724723079482
DSN1: Information Gain = 0.22593736620386706
RSRC2: Information Gain = 0.22592582520453242
PARP6: Information Gain = 0.22591244835177293
GOLGA6L4: Information Gain = 0.22585987765373616
RPL22P2: Information Gain = 0.2258563298921792
SEMA5B: Information Gain = 0.2258487492656709
HS3ST5: Information Gain = 0.22584424627690125
ABHD6: Information Gain = 0.2258378150095499
CSPG4P12: Information Gain = 0.2258360018119352
MVD: Information Gain = 0.225818193882376
SPEF1: Information Gain = 0.22581669393865234
ZBTB8OSP2: Information Gain = 0.22580600687585917
TIPARP: Information Gain = 0.2258039304974493
KIF18A: Information Gain = 0.225793881330286
CD2AP: Information Gain = 0.22578813190559677
MIR193A: Information Gain = 0.22577487031936583
SNTG2-AS1: Information Gain = 0.22576888348062907
POTEJ: Information Gain = 0.22575930605638206
TCIM: Information Gain = 0.2257503153907754
HCG4P8: Information Gain = 0.2257485323905728
GFI1: Information Gain = 0.22570000412270663
RNF165: Information Gain = 0.22569283128557416
SRA1: Information Gain = 0.22568535628360875
ZNF725P: Information Gain = 0.22566644877939868
PLA2G4F: Information Gain = 0.22564522580635948
TMEM156: Information Gain = 0.2256413299432718
FRG1EP: Information Gain = 0.22562728395669573
SHH: Information Gain = 0.22562379440663283
CD3E: Information Gain = 0.22560161105216947
LINC00501: Information Gain = 0.22559460916455842
ZNF723: Information Gain = 0.22558349132339672
FTH1P13: Information Gain = 0.2255589216124545
SCGB2A2: Information Gain = 0.22553542243190394
PCDHA4: Information Gain = 0.22553275185339716
FLT1: Information Gain = 0.22552985376771018
RASA4CP: Information Gain = 0.22549314520521535
SLITRK4: Information Gain = 0.22548833932817192
SDHDP6: Information Gain = 0.22547406390262736
SNORD117: Information Gain = 0.22545116968430023
SETP10: Information Gain = 0.22544972738785285
SNORA9: Information Gain = 0.22542240153389304
PDE6B: Information Gain = 0.22538863636529682
MAML2: Information Gain = 0.2253755902234167
HOTTIP: Information Gain = 0.2253736090571823
IFIT1: Information Gain = 0.22537305667278695
SYT3: Information Gain = 0.22535564999767232
PEX11G: Information Gain = 0.2253467516865837
WNT9A: Information Gain = 0.22534530593327706
LBP: Information Gain = 0.22534284204340027
PAFAH1B2P1: Information Gain = 0.22532973103345388
CNTN3: Information Gain = 0.22528457853993356
RCAN2: Information Gain = 0.22527033584070533
SEC62-AS1: Information Gain = 0.22526529769394177
DISP2: Information Gain = 0.2252622223761438
COX7A2P2: Information Gain = 0.22525483175028693
SIAH2-AS1: Information Gain = 0.22524075449389303
CKS1BP1: Information Gain = 0.22523539539884196
SPRY2: Information Gain = 0.22522541411757557
PC: Information Gain = 0.22522417101826897
MIR6814: Information Gain = 0.22521957311013763
OR51B5: Information Gain = 0.22521540761512182
NR3C2: Information Gain = 0.22518710691281818
ORC1: Information Gain = 0.22516466371877186
RPL12P13: Information Gain = 0.2251475749305043
SOWAHD: Information Gain = 0.22513441170626503
RPF2P1: Information Gain = 0.2251218445119152
FTH1P23: Information Gain = 0.22509199629335885
GAPDHP28: Information Gain = 0.22509142400035254
TSFM: Information Gain = 0.22508794892311212
PSMC5: Information Gain = 0.22508677666522448
ITGA2B: Information Gain = 0.22508555408704845
ZNF17: Information Gain = 0.2250688721009264
CCDC40: Information Gain = 0.22505262685973615
MIR6876: Information Gain = 0.22505106718675205
GLRX3P2: Information Gain = 0.22502183495385264
PTGER3: Information Gain = 0.22500315461377252
CREB3L2: Information Gain = 0.22500145566589058
SH3BP1: Information Gain = 0.225000596604771
FNDC4: Information Gain = 0.22499764947903422
TLE2: Information Gain = 0.22498618734807652
TGM1: Information Gain = 0.2249574575898059
PCDH8: Information Gain = 0.22494841406937982
PDZD2: Information Gain = 0.224929091222291
GTF3C6P2: Information Gain = 0.22492273955395214
UBE2CP4: Information Gain = 0.22491914192358342
ADCY7: Information Gain = 0.22491496679287293
VTN: Information Gain = 0.22491049963973309
LENG9: Information Gain = 0.2249100046177761
BNIP3P10: Information Gain = 0.22489929399975672
KIAA0930: Information Gain = 0.22488047073878525
FAUP2: Information Gain = 0.2248677238072061
CEMP1: Information Gain = 0.22486762153051654
ZC3H6: Information Gain = 0.2248586303949276
BNIP3P11: Information Gain = 0.22485690672312453
PDPN: Information Gain = 0.22482364171925684
CTNNA1P1: Information Gain = 0.22481906950685415
LY96: Information Gain = 0.2248136519447348
RPSAP14: Information Gain = 0.2247869464644885
WBP1LP2: Information Gain = 0.22473356013892798
RNU6-1055P: Information Gain = 0.22472541141619673
NIM1K: Information Gain = 0.22471945433407026
GPR87: Information Gain = 0.22471765399731547
MIR6510: Information Gain = 0.22468024055665725
RPL23AP8: Information Gain = 0.2246568612539377
MIR936: Information Gain = 0.22464747150583197
FZD9: Information Gain = 0.22463434556733408
ZNF74: Information Gain = 0.22462347796896465
USP8P1: Information Gain = 0.22461498868977015
KLB: Information Gain = 0.22461112184483278
KAT5: Information Gain = 0.22458576851011336
LINC01772: Information Gain = 0.22457190938955618
CLDND2: Information Gain = 0.22456922485683628
GPD1: Information Gain = 0.2245660318799867
ALDH2: Information Gain = 0.22454359735029517
TUFMP1: Information Gain = 0.22453735096733252
IRF1-AS1: Information Gain = 0.2245168967065445
GATA3-AS1: Information Gain = 0.22451072466360866
ANKRD49P2: Information Gain = 0.2244936333261467
ACACB: Information Gain = 0.22448438036138962
COL5A3: Information Gain = 0.22447406229690814
KCNMB1: Information Gain = 0.2244103975766465
RPL21P8: Information Gain = 0.22440428517129907
AGAP1-IT1: Information Gain = 0.22439010443446428
ZNF727: Information Gain = 0.22436672603563879
RPGRIP1: Information Gain = 0.22434974016347708
LINC00519: Information Gain = 0.22434577060661276
DSEL-AS1: Information Gain = 0.22434033444538537
PCDHAC1: Information Gain = 0.22432881742056132
MAP6: Information Gain = 0.22431405985225084
MYT1: Information Gain = 0.22429125125116545
MED10: Information Gain = 0.22428745944785589
PHF24: Information Gain = 0.22428099909452515
SLC30A6-DT: Information Gain = 0.22427906656585672
MMP1: Information Gain = 0.224269064147274
LINC02485: Information Gain = 0.22426789050499818
PGAM4: Information Gain = 0.22426471489326172
PITPNM3: Information Gain = 0.22425669199442733
AOX2P: Information Gain = 0.22425488274688177
RAET1E-AS1: Information Gain = 0.22422007668349364
LINC00323: Information Gain = 0.22419835976758895
SAV1: Information Gain = 0.22416015461134964
MRTFA-AS1: Information Gain = 0.22415746289935146
RNU6-436P: Information Gain = 0.22413098252457875
FBXO30-DT: Information Gain = 0.22412766691679398
PLCB2: Information Gain = 0.2241006486346262
PLEKHH2: Information Gain = 0.22407149849044905
RPL32P20: Information Gain = 0.22406380003571136
CNNM1: Information Gain = 0.2240365309759731
HECW2-AS1: Information Gain = 0.2240342339282666
HOPX: Information Gain = 0.22402758358792663
RPL17P36: Information Gain = 0.22402362399406095
RPL39P3: Information Gain = 0.22398811009599906
RASSF4: Information Gain = 0.22398105951040614
LINC01637: Information Gain = 0.22396875953406226
ZNF793: Information Gain = 0.2239634264600754
MIR6763: Information Gain = 0.22395602859440644
MMP2: Information Gain = 0.2239336576409816
LINC00365: Information Gain = 0.22393299226905872
ESR1: Information Gain = 0.22392655497283398
WNT5A-AS1: Information Gain = 0.2239231773394883
LINC01409: Information Gain = 0.22392144987971374
PTMAP12: Information Gain = 0.22391648198093939
KCTD12: Information Gain = 0.22390479825068077
TMEM171: Information Gain = 0.22389030392094145
RPL21P89: Information Gain = 0.2238889437415459
MCF2: Information Gain = 0.22386986003244336
LINC01094: Information Gain = 0.22386608979363398
KCNV2: Information Gain = 0.22385950045346314
OR1L8: Information Gain = 0.22385950045346314
RAMP2-AS1: Information Gain = 0.2238516927588834
PRSS3: Information Gain = 0.22384388228119945
SLAMF8: Information Gain = 0.22383744223794277
PDE4C: Information Gain = 0.22383449397190613
SLC17A5: Information Gain = 0.223825556852288
SEPTIN9-DT: Information Gain = 0.22381461621186416
SNCA: Information Gain = 0.22379868361360233
FOXI1: Information Gain = 0.2237823024125356
SMILR: Information Gain = 0.22376586199936876
PTPN21: Information Gain = 0.22374775789672774
EEF1A1P9: Information Gain = 0.22373872356552593
SMIM35: Information Gain = 0.22373852372919711
PCSK9: Information Gain = 0.2237358946038952
PTCHD3: Information Gain = 0.2237358946038952
SH3TC2-DT: Information Gain = 0.2237358946038952
CCDC106: Information Gain = 0.22370260486146054
CEL: Information Gain = 0.22369465402260924
TMEM230P1: Information Gain = 0.22369031712421594
S100A8: Information Gain = 0.2236851150660939
MT1E: Information Gain = 0.22368358358034435
GABARAPL3: Information Gain = 0.22368172186962187
RASSF10-DT: Information Gain = 0.22368124054146232
PTBP1P: Information Gain = 0.2236723483834242
PAICSP1: Information Gain = 0.22366634370986227
LINC00539: Information Gain = 0.2236598385358184
SCARNA12: Information Gain = 0.2236514802059537
DSG4: Information Gain = 0.22363517910156006
TCN1: Information Gain = 0.2236326518020011
ROR2: Information Gain = 0.22363017403009655
WDR62: Information Gain = 0.22361681155004431
LINC00276: Information Gain = 0.22359977295294997
USP54: Information Gain = 0.223573855941392
HNRNPM: Information Gain = 0.2235703198365835
EPX: Information Gain = 0.22356907813584592
IL2RG: Information Gain = 0.22356005075716512
TP73: Information Gain = 0.22355316355094756
PSMD10P2: Information Gain = 0.22354123920974178
LINC01152: Information Gain = 0.22353810979376854
NSG2: Information Gain = 0.22353810979376854
PRSS21: Information Gain = 0.22353810979376854
LINC00239: Information Gain = 0.2235316621027017
ZNF625: Information Gain = 0.22351737313437225
C1orf158: Information Gain = 0.22350457833804804
PSMB8: Information Gain = 0.22349754423291257
SNRPCP3: Information Gain = 0.22348872644850837
CD101: Information Gain = 0.22347511393290342
PBK: Information Gain = 0.22346589165294795
LINC01697: Information Gain = 0.223465041389179
NACAP2: Information Gain = 0.22346423301621443
SLC25A24P1: Information Gain = 0.22345097043718787
CDC42P5: Information Gain = 0.22344337787854984
MAST1: Information Gain = 0.2234333665628374
RPL7P44: Information Gain = 0.2234311121028345
LHFPL6: Information Gain = 0.22342757520484247
WWOX: Information Gain = 0.22342597629845584
RPS27AP6: Information Gain = 0.22342153556264144
RNA5SP260: Information Gain = 0.22341102626391973
CCL28: Information Gain = 0.2234060330129648
MIR583HG: Information Gain = 0.22340599987712784
IL6-AS1: Information Gain = 0.2234054813818509
C16orf86: Information Gain = 0.22338640638859686
MYO3B: Information Gain = 0.2233666766622071
ZXDB: Information Gain = 0.2233661126779063
CNGB1: Information Gain = 0.22334737837512297
TMSB10P1: Information Gain = 0.22332260495281187
OR2A42: Information Gain = 0.22331377933395768
MIR937: Information Gain = 0.2233054371729164
SLC25A38P1: Information Gain = 0.22328472993450688
IMPDH1P9: Information Gain = 0.22328058828398767
TMEM229A: Information Gain = 0.22327793900021264
KLHDC7B: Information Gain = 0.22326907150031894
MECP2: Information Gain = 0.22326118857987254
NAV2-AS2: Information Gain = 0.22325237939059916
C11orf94: Information Gain = 0.22323616883601916
MIR3654: Information Gain = 0.22323337706410062
ZNF804B: Information Gain = 0.22323330737608993
SH3BP4: Information Gain = 0.22323063465259807
MFF-DT: Information Gain = 0.22320311371472124
BRPF3-AS1: Information Gain = 0.223193706493944
PARVB: Information Gain = 0.2231919398546971
RDM1: Information Gain = 0.2231886161213814
LGALS1: Information Gain = 0.22318826380748824
SETP8: Information Gain = 0.22318708688210576
BHMT: Information Gain = 0.22318433091147916
MIX23P5: Information Gain = 0.22316732934717143
CCDC60: Information Gain = 0.22313278631859879
TBXA2R: Information Gain = 0.2231305297794559
LINC02157: Information Gain = 0.22309733900948614
LINC00115: Information Gain = 0.22308694112915228
HEATR4: Information Gain = 0.223084288100313
TPT1P6: Information Gain = 0.22307167833676478
CCDC17: Information Gain = 0.22306395883097285
IL17RD: Information Gain = 0.22306161663587543
ACTG1P15: Information Gain = 0.2230388903218974
LINC00894: Information Gain = 0.2230322926968593
DYRK3: Information Gain = 0.22302438033747252
RNF157-AS1: Information Gain = 0.22302376084481956
TTC3-AS1: Information Gain = 0.22301918681454524
RSAD2: Information Gain = 0.22301198715924264
RPL15P2: Information Gain = 0.22300658149052488
ANAPC10P1: Information Gain = 0.22300494581019792
MIF4GD-DT: Information Gain = 0.2229884017495336
ZBED6CL: Information Gain = 0.22297773857181147
MED14OS: Information Gain = 0.22296262437477732
PTMAP11: Information Gain = 0.22294952509312127
MZB1: Information Gain = 0.22294784886073282
RSKR: Information Gain = 0.22294754879185885
ZNF551: Information Gain = 0.22294611717876012
GPAT4-AS1: Information Gain = 0.22294376478483735
CKMT2: Information Gain = 0.22294070511609632
MIR3918: Information Gain = 0.2229390706913459
RSL24D1: Information Gain = 0.2229373956984424
SNX19P3: Information Gain = 0.2229342976229538
SQLE-DT: Information Gain = 0.2229331899246083
LINC01424: Information Gain = 0.22292821201801072
GPRC5D-AS1: Information Gain = 0.22292487493762891
SMCR5: Information Gain = 0.22292352105948665
GAPDHP2: Information Gain = 0.22291567990475358
ZNF702P: Information Gain = 0.22291125133807133
FKBP6: Information Gain = 0.222908156156437
LINC01535: Information Gain = 0.22289948439571639
TROAP: Information Gain = 0.22288866918779227
NAA20P1: Information Gain = 0.2228800244977458
EEF1DP8: Information Gain = 0.22287920160507646
SAP18P2: Information Gain = 0.22285897326944637
ZNF391: Information Gain = 0.2228344763519845
MIR27B: Information Gain = 0.22283182533609214
LINC01356: Information Gain = 0.22281207463921526
RPS2P24: Information Gain = 0.22277693794873588
KRT6A: Information Gain = 0.22277693794873588
TF: Information Gain = 0.22276994904329195
BIRC3: Information Gain = 0.22274767879057666
NOXO1: Information Gain = 0.22274724696824677
EPHX4: Information Gain = 0.22270507499301107
CPB2-AS1: Information Gain = 0.22270416567054663
FOXB1: Information Gain = 0.22270128496166608
CCDC184: Information Gain = 0.22268075224703976
DSCAML1: Information Gain = 0.22266116061060348
MIR7706: Information Gain = 0.222658206250139
LINC01892: Information Gain = 0.22265761412083074
MIR6746: Information Gain = 0.22265573217433077
IGSF9B: Information Gain = 0.2226379337305091
GLYCTK-AS1: Information Gain = 0.2226372416175102
GAB2: Information Gain = 0.22263651714613042
TAF7L: Information Gain = 0.22260849877009203
HMSD: Information Gain = 0.22260286609095914
GFRA3: Information Gain = 0.22260097738399875
PAEP: Information Gain = 0.22258747159566616
LINC01285: Information Gain = 0.22256730636063615
GSEC: Information Gain = 0.22256195756015118
IDSP1: Information Gain = 0.22256148655267038
HNF1A: Information Gain = 0.22255917367147293
PDZD4: Information Gain = 0.22255675030074906
F2R: Information Gain = 0.22250655782279338
MARCHF5: Information Gain = 0.22250223912603095
UNC93B3: Information Gain = 0.22249821281836502
FAM124A: Information Gain = 0.22248728595897815
ARMC10P1: Information Gain = 0.22247549665768207
SUGT1P4: Information Gain = 0.2224722148173437
CRYM: Information Gain = 0.22245248738172219
TAS2R31: Information Gain = 0.2224494659773737
ST13P15: Information Gain = 0.2224391254502467
ARL5AP5: Information Gain = 0.2224369651632765
PTP4A1P4: Information Gain = 0.22243452614788795
HS3ST3A1: Information Gain = 0.22242764399185
RNVU1-19: Information Gain = 0.22242268218893435
SV2C: Information Gain = 0.22240451806077233
SOHLH1: Information Gain = 0.22240451806077233
MAPRE2: Information Gain = 0.22239690162821546
ACTG2: Information Gain = 0.22237283921038875
SFMBT2: Information Gain = 0.2223697866760277
HYI: Information Gain = 0.2223417442935447
SCX: Information Gain = 0.22233544836447594
RPL24P2: Information Gain = 0.22231914946925135
PTX3: Information Gain = 0.22231663292685488
KIF21B: Information Gain = 0.222303983241958
MIR4434: Information Gain = 0.22229783889512356
CCNYL7: Information Gain = 0.22229454540333138
RPL7P8: Information Gain = 0.2222818759942513
RNA5SP221: Information Gain = 0.22226061475374803
LINC01425: Information Gain = 0.2222479742113883
CHRFAM7A: Information Gain = 0.22223621099270696
NHLRC1: Information Gain = 0.2222334428773145
WNT4: Information Gain = 0.22223101703581905
SF3B4P1: Information Gain = 0.22222075896697002
NBEAP6: Information Gain = 0.22221788135833287
RPSAP26: Information Gain = 0.2222144545220639
MIR215: Information Gain = 0.2221823764732278
MEX3B: Information Gain = 0.22218210741933642
LETR1: Information Gain = 0.22215655926850997
ZSCAN18: Information Gain = 0.2221483230839829
PRDM16: Information Gain = 0.22214610228317855
MAST3: Information Gain = 0.2221451324347805
EEF1A1P12: Information Gain = 0.22212929172816143
PRKG2: Information Gain = 0.2221263144493386
IL1R2: Information Gain = 0.2221209773359285
FANCE: Information Gain = 0.2221138314835598
CDH5: Information Gain = 0.22211203027244864
RHOT1P3: Information Gain = 0.2221066168898902
MTRNR2L8: Information Gain = 0.2221033370875347
XIAPP1: Information Gain = 0.2220994433886092
BRI3BP: Information Gain = 0.22209759569355625
DPYSL5: Information Gain = 0.22207887957593342
CDCA3: Information Gain = 0.22207528667595744
EPAS1: Information Gain = 0.22206453649622881
LINC02506: Information Gain = 0.22205937368831985
MYADM: Information Gain = 0.22205572410206775
CRMP1: Information Gain = 0.2220547272046527
ARHGAP42-AS1: Information Gain = 0.22203959875940837
ACTG1P9: Information Gain = 0.22201835513703982
CFHR5: Information Gain = 0.22201753578815087
SUSD3: Information Gain = 0.2220004708859531
OR8B10P: Information Gain = 0.22197360234562846
NT5CP1: Information Gain = 0.22197279982638118
POU5F1B: Information Gain = 0.2219718697983868
PRNCR1: Information Gain = 0.22196461443962257
MIR4740: Information Gain = 0.22195885613450672
SRP9P1: Information Gain = 0.22195423451650864
DYSF: Information Gain = 0.22195280673714257
ATP5MKP1: Information Gain = 0.22195274014067423
TUBB2BP1: Information Gain = 0.22194299475738877
ADAM29: Information Gain = 0.22193775513735714
EHD4-AS1: Information Gain = 0.22193493665764663
ZFHX2: Information Gain = 0.221929966868053
AGXT: Information Gain = 0.2219279237192977
PLAC4: Information Gain = 0.22192654729184813
NPM1P46: Information Gain = 0.2219210440804631
CRISPLD1: Information Gain = 0.22192097741785988
HOXA5: Information Gain = 0.22191297943802368
TNFRSF14: Information Gain = 0.22191225364445732
MIR21: Information Gain = 0.22189958133390753
EID2B: Information Gain = 0.22189750083659798
ADTRP: Information Gain = 0.22189466676394587
CIT: Information Gain = 0.221888015271535
RAB42: Information Gain = 0.22187640839548672
PTPRB: Information Gain = 0.2218761348201832
SDSL: Information Gain = 0.22187572108348896
RN7SL535P: Information Gain = 0.22186900912025886
ZNF114-AS1: Information Gain = 0.22185524828120462
PTTG3P: Information Gain = 0.2218462778438559
MMP11: Information Gain = 0.2218454319690215
KRT8P1: Information Gain = 0.2218410422971755
ERVK-28: Information Gain = 0.2218379345036825
NEAT1: Information Gain = 0.22183672616170957
FDPSP4: Information Gain = 0.22183118827333925
RPS6KA6: Information Gain = 0.22180340205538207
RBM22P2: Information Gain = 0.22180000653853948
ITPRIP: Information Gain = 0.22179639544312213
LINC02680: Information Gain = 0.22178412405687586
C1orf216: Information Gain = 0.22176781611361163
FDPSP7: Information Gain = 0.2217650682973238
PTPRD: Information Gain = 0.22174514643348298
RN7SL659P: Information Gain = 0.22174244168394264
MIR3190: Information Gain = 0.22174146803063377
RNU6-163P: Information Gain = 0.22174146803063377
C21orf62: Information Gain = 0.22174146803063377
SEC14L1P1: Information Gain = 0.22173893158360447
ADRA1B: Information Gain = 0.2217388042802726
RTEL1: Information Gain = 0.2217311540095095
TTC23L-AS1: Information Gain = 0.22172544524211957
GLS2: Information Gain = 0.22172056591624623
CALN1: Information Gain = 0.2217064268882354
TGM3: Information Gain = 0.2217064268882354
CCN6: Information Gain = 0.2217064268882354
ZNF577: Information Gain = 0.22169893100519067
WDR77: Information Gain = 0.22167803996158875
RPL21P44: Information Gain = 0.22167737951717803
PTPRM: Information Gain = 0.22166789500422568
SOSTDC1: Information Gain = 0.22166303907562268
SYDE1: Information Gain = 0.2216630243693254
PRDX2P1: Information Gain = 0.22164266123455678
KANSL1L-AS1: Information Gain = 0.22162657258758967
BPIFA4P: Information Gain = 0.22160600229448413
FAM95C: Information Gain = 0.22158952186983782
SOBP: Information Gain = 0.22156479325018186
LINC00621: Information Gain = 0.22156082477410521
STAB2: Information Gain = 0.22155948451609264
BACE2: Information Gain = 0.22154550533981854
MIR3187: Information Gain = 0.22154398645310724
EMSLR: Information Gain = 0.221543710819881
LINC02318: Information Gain = 0.22153626600308352
DUTP6: Information Gain = 0.22153245103652708
UBE2R2-AS1: Information Gain = 0.2215268954705416
SLC7A1: Information Gain = 0.22152248258280638
FRG1-DT: Information Gain = 0.22151132947501995
ADGRD1: Information Gain = 0.2215103305245052
RNA5SP343: Information Gain = 0.22150938535911435
MAG: Information Gain = 0.2215092907476175
ZNF25: Information Gain = 0.22150622274268894
MIR5196: Information Gain = 0.22150475031801875
MIR6834: Information Gain = 0.22149541485736557
PNMT: Information Gain = 0.22149219638104567
RPL23AP52: Information Gain = 0.22149219638104567
RPL35AP2: Information Gain = 0.2214915146925096
SNORA25: Information Gain = 0.22147502369396932
TRAF6P1: Information Gain = 0.22146748041487418
HIGD1AP14: Information Gain = 0.22144351836598042
ARMH1: Information Gain = 0.22143135239739165
DLGAP4: Information Gain = 0.22142963747256084
LINC01508: Information Gain = 0.2214178640476454
SCUBE1: Information Gain = 0.22136643139998813
LRMDA: Information Gain = 0.22132988135410403
CDC20P1: Information Gain = 0.22131904078253273
FBXL2: Information Gain = 0.2213071951155159
OR7E29P: Information Gain = 0.22130468009800985
RNU6-780P: Information Gain = 0.2212888209376005
FCF1P1: Information Gain = 0.22128377816169675
GLRB: Information Gain = 0.22127206508413688
ALG8: Information Gain = 0.22126967484723759
IL6: Information Gain = 0.22126196737124815
CAVIN3: Information Gain = 0.2212541125591012
MLPH: Information Gain = 0.22123445679029285
LINC02178: Information Gain = 0.2212335746005727
POTEF: Information Gain = 0.22121867900919523
LINC00572: Information Gain = 0.22121468875885575
ATOH8: Information Gain = 0.2212091995210832
NLGN1: Information Gain = 0.2212084699106105
HORMAD2-AS1: Information Gain = 0.2211920907002185
EMILIN2: Information Gain = 0.22118849984627098
NLRP2B: Information Gain = 0.22118630062420053
SHBG: Information Gain = 0.2211752804762448
FUT5: Information Gain = 0.22116578170214374
GJA1P1: Information Gain = 0.22114589607053792
PIEZO2: Information Gain = 0.22114315126162198
SPINK2: Information Gain = 0.2211416118705951
SLC12A8: Information Gain = 0.2211326369395803
CAPN9: Information Gain = 0.22111935408520988
MYCL: Information Gain = 0.22111700506025356
DDX3Y: Information Gain = 0.22110576296182094
SAMSN1: Information Gain = 0.22110119650155835
CFTR: Information Gain = 0.22110023308958748
GPR161: Information Gain = 0.22108707966903784
KRT17P6: Information Gain = 0.22108692631323734
TOMM20L-DT: Information Gain = 0.2210811478035788
KCNG2: Information Gain = 0.2210805332688699
TEX44: Information Gain = 0.22107636309337697
CDK8P1: Information Gain = 0.22107234271358056
HCG4B: Information Gain = 0.22105021049080387
ATP6V1E1P1: Information Gain = 0.22103638389450753
ASB14: Information Gain = 0.22103427379390506
FRG1KP: Information Gain = 0.22102006508152794
ANKRD7: Information Gain = 0.22102006508152794
ATP5PBP2: Information Gain = 0.22102006508152794
ASS1P8: Information Gain = 0.22101291453036498
MIAT: Information Gain = 0.22101264165587953
MN1: Information Gain = 0.22100559739741188
BMPR1B: Information Gain = 0.2210055720640267
AOX1: Information Gain = 0.22100281673818523
CHP1P3: Information Gain = 0.2209906342420629
ZNF462: Information Gain = 0.22097049348198938
PTPRVP: Information Gain = 0.22095107412042325
DNAI4: Information Gain = 0.22094808537624733
ACAD8: Information Gain = 0.22094636443109228
SNORA60: Information Gain = 0.22093633169206006
ALG1L13P: Information Gain = 0.22092977727861718
CATSPERE: Information Gain = 0.2209231190594696
EIF4A2P1: Information Gain = 0.2209193910218481
GAPDHS: Information Gain = 0.2209180375382187
CMAHP: Information Gain = 0.22091183325151942
KLK10: Information Gain = 0.220901643018385
RN7SKP30: Information Gain = 0.2208964884672997
LINC00350: Information Gain = 0.22089045436043642
SLC35E1: Information Gain = 0.22086406972135686
IFITM3P2: Information Gain = 0.22086168648381221
ABCA10: Information Gain = 0.22086094080226104
LHX1: Information Gain = 0.22085075155134448
MIR1260B: Information Gain = 0.22083268535761058
CYP2C8: Information Gain = 0.22083057336973932
PGAM1P7: Information Gain = 0.22082042020697568
BRAFP1: Information Gain = 0.22081567857363527
ITGA9: Information Gain = 0.22080445019757478
CRB2: Information Gain = 0.22079984445573242
CHRNA7: Information Gain = 0.22077564846464814
RPS15AP12: Information Gain = 0.2207738812303457
NUP50P1: Information Gain = 0.22077320721268956
ARHGEF35: Information Gain = 0.2207696488413926
MAP3K7CL: Information Gain = 0.2207639886710342
KPNA4P1: Information Gain = 0.22074837107651102
HYKK: Information Gain = 0.22073991405574955
FCGR2B: Information Gain = 0.22072709000446888
TRIML2: Information Gain = 0.22072067974718723
TNRC6B-DT: Information Gain = 0.2207071499264479
UBR5-DT: Information Gain = 0.22069990681501062
TMEM130: Information Gain = 0.2206858012030959
SOX21-AS1: Information Gain = 0.22068535993338645
BMS1P22: Information Gain = 0.22068440249786692
TLR3: Information Gain = 0.22068030170048014
RPL13AP23: Information Gain = 0.22065861429473554
LINC02226: Information Gain = 0.22065219929623203
RAB28P5: Information Gain = 0.2206480762614711
BDKRB2: Information Gain = 0.2206200243428078
RN7SL130P: Information Gain = 0.220616371256803
FRG1FP: Information Gain = 0.22061606389890054
CHKA-DT: Information Gain = 0.22060544087028955
RNU4-22P: Information Gain = 0.22060431629365218
NDUFB2: Information Gain = 0.22059479019570372
NDUFAB1P1: Information Gain = 0.22058994673523635
TEX53: Information Gain = 0.2205869797622637
SLC25A48: Information Gain = 0.22058415141147614
ABCB4: Information Gain = 0.22058132848620504
KRTAP10-2: Information Gain = 0.2205809021548779
HRH1: Information Gain = 0.22057211171887237
RPL6P25: Information Gain = 0.22057211171887237
RBM22P4: Information Gain = 0.22057211171887237
EGFLAM-AS1: Information Gain = 0.2205595901231432
PPP1R2B: Information Gain = 0.2205445967130093
CYCSP24: Information Gain = 0.22053417497207484
GABPB1: Information Gain = 0.22053313538213093
RNU6-957P: Information Gain = 0.22052846689674666
RAD21P1: Information Gain = 0.22051123373060766
ROM1: Information Gain = 0.22050510586640182
IGHG4: Information Gain = 0.2204976062256645
PDCD6IPP2: Information Gain = 0.22049344843234775
SALL2: Information Gain = 0.2204840186775059
CPP: Information Gain = 0.2204819755360976
ELOVL3: Information Gain = 0.22046964169242877
ADAMTS6: Information Gain = 0.22046223454930525
FAM3B: Information Gain = 0.22045815815283043
COX20P2: Information Gain = 0.2204554386591615
MTND5P26: Information Gain = 0.2204535549943032
NASPP1: Information Gain = 0.2204467725528938
LINC00589: Information Gain = 0.22044222476579511
ZNF132-DT: Information Gain = 0.2204364698830712
EYS: Information Gain = 0.22043208990307117
RPS19P7: Information Gain = 0.22042729882751888
PTGES2: Information Gain = 0.22042607037568307
LINC02600: Information Gain = 0.22042196214014043
MRPS11: Information Gain = 0.2204174704210342
PRKCZ-AS1: Information Gain = 0.22040403075674875
PLEKHO2: Information Gain = 0.22039198610693744
MIR16-1: Information Gain = 0.2203862567861823
MTATP8P1: Information Gain = 0.2203780034564724
DNAAF4: Information Gain = 0.220374642998302
ABI1: Information Gain = 0.220369538283115
SEPHS2: Information Gain = 0.22036452529147854
UGP2: Information Gain = 0.22035628942337282
SUSD2: Information Gain = 0.22035016581989275
TSSK2: Information Gain = 0.22034510712808886
MIR6823: Information Gain = 0.22034401742428766
CARS1: Information Gain = 0.22034046168236232
CAMP: Information Gain = 0.220337157928147
SERPINA6: Information Gain = 0.22033329037264515
BDKRB1: Information Gain = 0.2203157653363803
LINC00845: Information Gain = 0.2203115790793131
TMEM178A: Information Gain = 0.22030427243535544
APBA2: Information Gain = 0.22029278268318198
IBSP: Information Gain = 0.2202872245377676
RN7SKP56: Information Gain = 0.22028015166013337
CTBP2P3: Information Gain = 0.22026564843499763
ISM1-AS1: Information Gain = 0.22026067890309342
RPL12P28: Information Gain = 0.22025942253068775
FGF7: Information Gain = 0.22025508521513726
ADGRG3: Information Gain = 0.22025244938264232
NEXMIF: Information Gain = 0.22024957939411172
RNU6-319P: Information Gain = 0.22024806756298632
SPATA4: Information Gain = 0.220242128492266
NBPF20: Information Gain = 0.22022339159489213
RPL36P4: Information Gain = 0.2202197994176296
GPC2: Information Gain = 0.22021706900763305
ABLIM1: Information Gain = 0.22021161568628034
JPH1: Information Gain = 0.2202091342265473
MIR3960: Information Gain = 0.220207882712675
OR5M3: Information Gain = 0.22020446094709478
ST8SIA6: Information Gain = 0.22019324599983814
LINC02641: Information Gain = 0.22017905456551334
ARF1P1: Information Gain = 0.22017481392960625
NPM1P24: Information Gain = 0.22017215016111957
MIR6838: Information Gain = 0.22016914050731984
IGHEP1: Information Gain = 0.22016809542242832
CTRB2: Information Gain = 0.22015365599705583
MYLK-AS1: Information Gain = 0.2201420681097086
VPS26BP1: Information Gain = 0.22012047868105133
MYOG: Information Gain = 0.2201010527262195
FBN1: Information Gain = 0.2200941445839022
SRSF3P5: Information Gain = 0.22008894509646537
RAP1AP: Information Gain = 0.22007902258378476
CROCCP4: Information Gain = 0.2200759559193557
SPDYE21: Information Gain = 0.22007428383904482
FOXN1: Information Gain = 0.22006668861071566
ATP5PBP7: Information Gain = 0.2200648921451973
TPI1P4: Information Gain = 0.22005749737741742
ZBTB39: Information Gain = 0.22004285547511748
FAM183A: Information Gain = 0.22003883410462777
ADH4: Information Gain = 0.2200359993920573
PLA2G1B: Information Gain = 0.2200273330095368
ELN: Information Gain = 0.2200273330095368
GNE: Information Gain = 0.22002514024312592
EEF1A1P29: Information Gain = 0.2200138235779603
RPL22P24: Information Gain = 0.22000314348523142
CD207: Information Gain = 0.2200030679623748
MIR146B: Information Gain = 0.22000117995145896
LINC02280: Information Gain = 0.22000117995145896
LINC02055: Information Gain = 0.21999467681015838
PLP1: Information Gain = 0.2199908126493908
MIR4482: Information Gain = 0.21998741720067394
MRPS5P3: Information Gain = 0.21998652355144244
LINC02888: Information Gain = 0.21998398158548227
TRAV29DV5: Information Gain = 0.2199681710041721
CATSPERZ: Information Gain = 0.21996065168358414
HMGA2-AS1: Information Gain = 0.21995895026710333
TINAGL1: Information Gain = 0.21995714187941018
MIR6506: Information Gain = 0.2199531323011754
LCE1B: Information Gain = 0.21994833968711025
BCAP31P2: Information Gain = 0.21994125864423641
COX5AP2: Information Gain = 0.2199360143267559
MIR1279: Information Gain = 0.219925615107186
CSRP3-AS1: Information Gain = 0.2199183697843652
LINC02012: Information Gain = 0.2199137423105031
MIR6779: Information Gain = 0.2199114473939241
TRBV20OR9-2: Information Gain = 0.21990712137473944
RPL8P2: Information Gain = 0.21990301112888577
OPN3: Information Gain = 0.21987213988064713
HCAR2: Information Gain = 0.21987050217394688
VSIG1: Information Gain = 0.21985388349446788
LDLRAD4-AS1: Information Gain = 0.21984962745208425
TDRP: Information Gain = 0.21984420783516034
LIPE: Information Gain = 0.21984070346878326
MIX23P3: Information Gain = 0.21983972941730645
TSPY26P: Information Gain = 0.21982566224851374
GLULP4: Information Gain = 0.21982523383152497
SCHIP1: Information Gain = 0.21980425115156055
MTMR9LP: Information Gain = 0.21979908399584702
CCNI2: Information Gain = 0.21979560745696203
CLPS: Information Gain = 0.219795038719238
DLGAP5: Information Gain = 0.2197871507926623
TOLLIP-DT: Information Gain = 0.2197850834165027
SMIM6: Information Gain = 0.2197794795971264
EDA: Information Gain = 0.21977647980597115
LINC01686: Information Gain = 0.2197756156692119
ADAMTS7: Information Gain = 0.21977089992082544
SMCO2: Information Gain = 0.21976613148251256
RN7SKP116: Information Gain = 0.21976091228049066
H1-12P: Information Gain = 0.21975871357632437
KLF7P1: Information Gain = 0.21971961394908956
FNTAP1: Information Gain = 0.21971812984950434
MIR3609: Information Gain = 0.21971664574991912
LINC02518: Information Gain = 0.21967551588205358
NAV2-AS3: Information Gain = 0.21966797782150427
RASA3-IT1: Information Gain = 0.2196619020270172
MTX3: Information Gain = 0.2196608614134199
OR8A3P: Information Gain = 0.21965207812895238
MPC1-DT: Information Gain = 0.2196469697121981
ZNF827: Information Gain = 0.21963611150938722
LINC00634: Information Gain = 0.2196246579544603
BMS1P15: Information Gain = 0.2196211683233089
YWHAZP2: Information Gain = 0.2196110332880763
HAL: Information Gain = 0.219608344198863
RPL3P8: Information Gain = 0.21960492942674015
PRTN3: Information Gain = 0.21960001666211482
PDE10A: Information Gain = 0.21956517063595848
TTLL1-AS1: Information Gain = 0.21955125922544516
UMODL1-AS1: Information Gain = 0.219547848673457
OR10D3: Information Gain = 0.219547848673457
RPS4XP8: Information Gain = 0.21954433334535994
ARHGAP29: Information Gain = 0.21953476817445772
SH2D5: Information Gain = 0.2195030724431366
COPS8P2: Information Gain = 0.21949981045591582
MIR6075: Information Gain = 0.21949431293537747
RPS26P41: Information Gain = 0.2194900187063995
KCNG4: Information Gain = 0.21948654995293104
CEP126: Information Gain = 0.21947755807834657
MGAT4EP: Information Gain = 0.21947503916078115
SLC2A3P4: Information Gain = 0.21946603320333713
MKI67: Information Gain = 0.21946528281250544
TMPRSS7: Information Gain = 0.21946082119991628
RNA5SP283: Information Gain = 0.219456087679613
KCNJ6: Information Gain = 0.21943333766393436
PROKR1: Information Gain = 0.21941528682787026
YPEL5P2: Information Gain = 0.21939220869490983
MSN: Information Gain = 0.2193906542874151
RN7SL431P: Information Gain = 0.2193906542874151
SPEF2: Information Gain = 0.2193877364698562
TGIF1P1: Information Gain = 0.21938660483443506
AKAP12: Information Gain = 0.21937891539318266
GRM6: Information Gain = 0.2193774936421804
SLC6A16: Information Gain = 0.21937462466881663
CHRNE: Information Gain = 0.21936572378661245
RPL18AP15: Information Gain = 0.2193612785432859
GATA6-AS1: Information Gain = 0.21936016573304618
BACH1-IT1: Information Gain = 0.21935426750261056
LINC01441: Information Gain = 0.21933735041446734
CAMK2D: Information Gain = 0.21933703767933332
LINC01134: Information Gain = 0.21933365314320574
SLC5A5: Information Gain = 0.21932781758763942
MAFTRR: Information Gain = 0.2193270998495862
HMGN1P35: Information Gain = 0.2193107983135718
GPR37L1: Information Gain = 0.2193028839303255
MIR6844: Information Gain = 0.2193011873417241
NELL1: Information Gain = 0.21929780585595915
GJA1: Information Gain = 0.21929780585595915
MRAP-AS1: Information Gain = 0.21929780585595915
MESP2: Information Gain = 0.21929693393484873
ALMS1-IT1: Information Gain = 0.21929503911302017
GRXCR2: Information Gain = 0.2192911844885781
SPIRE1: Information Gain = 0.21928945916033982
GSTP1: Information Gain = 0.21928941894672427
CYP4Z1: Information Gain = 0.21927098652157762
KRT8P43: Information Gain = 0.21926272321480256
SLC52A3: Information Gain = 0.21925499846142937
CBX5P1: Information Gain = 0.2192509704853678
MIR4690: Information Gain = 0.21925042676521245
TSSK3: Information Gain = 0.219241184261594
TXNP4: Information Gain = 0.21924111163652849
FOXD2-AS1: Information Gain = 0.21923856847924794
DAPK1: Information Gain = 0.21922888693836962
C16orf92: Information Gain = 0.21921962898128666
PLCD4: Information Gain = 0.2192169088074578
TCEAL8: Information Gain = 0.21921434629667402
PPIL1: Information Gain = 0.21921052125174256
MANBA: Information Gain = 0.21920962617657525
LINC01747: Information Gain = 0.21917262074557264
DNM3: Information Gain = 0.21916517635286792
PRICKLE2-AS3: Information Gain = 0.2191527986306836
CCDC110: Information Gain = 0.21913983032938478
HOMER2: Information Gain = 0.21911711239292053
NPIPA9: Information Gain = 0.21911442121728752
MIR6790: Information Gain = 0.21911213885067182
TMSB15B-AS1: Information Gain = 0.21910763914173637
IFI6: Information Gain = 0.21910588704502199
ZNF419: Information Gain = 0.21910083392756619
SYT11: Information Gain = 0.21908664219542828
LINC02851: Information Gain = 0.21908629971090865
SNTG1: Information Gain = 0.21908617954090182
HCLS1: Information Gain = 0.21907176607295709
UBASH3A: Information Gain = 0.2190716119662348
OR8G5: Information Gain = 0.21907154179839305
HLA-DQB2: Information Gain = 0.21907035137657793
KCTD5P1: Information Gain = 0.2190605595716808
GSDMD: Information Gain = 0.21905790344421439
NRN1L: Information Gain = 0.21904979442287442
GAB3: Information Gain = 0.21904968381574164
EIF3IP1: Information Gain = 0.21904923283932698
RNF222: Information Gain = 0.21904708523340322
SLC22A13: Information Gain = 0.21904708523340322
CLRN1-AS1: Information Gain = 0.21904708523340322
GNG10P1: Information Gain = 0.21904708523340322
HSP90AA4P: Information Gain = 0.21904708523340322
CDHR4: Information Gain = 0.21904708523340322
EXTL3-AS1: Information Gain = 0.2190447046177244
PSMC1P8: Information Gain = 0.21904391735650885
MIR5188: Information Gain = 0.21903092130714952
P2RY1: Information Gain = 0.21902744320363343
EIF3LP1: Information Gain = 0.21902242009734452
TMTC2: Information Gain = 0.2190084516323363
KLF3P1: Information Gain = 0.21900313305337482
F7: Information Gain = 0.21899325808305692
SV2B: Information Gain = 0.21899256233175324
OR8T1P: Information Gain = 0.21898137927787364
RNF20: Information Gain = 0.21896515982186848
ANKRD11P2: Information Gain = 0.2189622085299292
DDX59-AS1: Information Gain = 0.21895832495642553
OPN1SW: Information Gain = 0.2189557586157176
LINC01366: Information Gain = 0.21894912509176523
NLRP3P1: Information Gain = 0.21894145879094218
LINC00534: Information Gain = 0.21893950787544347
SEPTIN7P8: Information Gain = 0.21893651981429318
PHBP7: Information Gain = 0.21893285800001427
RNU6-883P: Information Gain = 0.21892721893201106
GAPDHP67: Information Gain = 0.21892590279418256
RRN3P2: Information Gain = 0.2189184160830855
CHI3L1: Information Gain = 0.21889713409452782
OXCT1: Information Gain = 0.21889641293245288
MFAP4: Information Gain = 0.21889040132177984
BET1: Information Gain = 0.2188866253984434
RPS2P2: Information Gain = 0.2188808340078625
HYI-AS1: Information Gain = 0.21887865080526514
IDH1-AS1: Information Gain = 0.21887509085839008
PINCR: Information Gain = 0.2188742479929644
PAQR8: Information Gain = 0.21887259534247505
ZNF460-AS1: Information Gain = 0.21883663258049846
MIRLET7F1: Information Gain = 0.2188290299632898
PSMC1P11: Information Gain = 0.2188280046207478
H2BC18: Information Gain = 0.2188096058930673
ALDH1A3-AS1: Information Gain = 0.21880736814590906
GAPDHP48: Information Gain = 0.21880694049583993
ZNF649: Information Gain = 0.21880694049583993
PHF2P2: Information Gain = 0.21880300075640458
PPARGC1A: Information Gain = 0.21879140751251058
ANP32BP1: Information Gain = 0.218766766933584
ADAMTS2: Information Gain = 0.2187602544481544
RNU6-418P: Information Gain = 0.2187601439033635
MAP3K2-DT: Information Gain = 0.2187580077336324
AATBC: Information Gain = 0.21874842324766464
RNA5SP439: Information Gain = 0.21874393797237723
HMGN2P38: Information Gain = 0.21873991603138698
FAM3D: Information Gain = 0.21873622345102217
RTCA-AS1: Information Gain = 0.21873485306064455
HIC2: Information Gain = 0.21872301921703152
UGT1A12P: Information Gain = 0.21872075456688234
FHAD1: Information Gain = 0.21871522997142812
PCOLCE2: Information Gain = 0.2187127878725148
LINC00858: Information Gain = 0.21870465510381876
HS3ST6: Information Gain = 0.21869951451816072
MAPK8IP2: Information Gain = 0.21869565168378702
TAPT1-AS1: Information Gain = 0.21869534641190347
SLC1A6: Information Gain = 0.2186946869847768
LINC00664: Information Gain = 0.21869217855047607
RPL21P41: Information Gain = 0.218687419406965
INPP5J: Information Gain = 0.21868212550470467
SCARNA3: Information Gain = 0.21868028431280462
MTND4LP30: Information Gain = 0.21868028431280462
HLA-DRB9: Information Gain = 0.21867873166085072
STX7: Information Gain = 0.2186646990998251
PRB3: Information Gain = 0.21865564876809906
VDAC1P7: Information Gain = 0.2186459065574473
TONSL-AS1: Information Gain = 0.21864207477623432
TLR6: Information Gain = 0.21863522233859056
SF3A3P1: Information Gain = 0.21863403584962438
SHOX2: Information Gain = 0.21862163415733993
MIR637: Information Gain = 0.21860478225315694
LINC01397: Information Gain = 0.21859768846280359
OR8B2: Information Gain = 0.21859760424973795
RN7SL743P: Information Gain = 0.2185961699027421
MIR193B: Information Gain = 0.2185924370335015
HAUS6P1: Information Gain = 0.2185844986537926
PTGS1: Information Gain = 0.2185815519207872
ZNF320: Information Gain = 0.21857343432859766
LINC00266-1: Information Gain = 0.21857055620549604
MRPS31P2: Information Gain = 0.2185674709357257
SF3A3P2: Information Gain = 0.21856459722484112
LEFTY1: Information Gain = 0.21855053976059158
SYNPR-AS1: Information Gain = 0.21854939025259967
RN7SL164P: Information Gain = 0.21854751957406937
ALOX12B: Information Gain = 0.2185437183844534
MIR421: Information Gain = 0.21854070194658992
MT-TV: Information Gain = 0.21853550613251738
HERC2P3: Information Gain = 0.21853280843229528
CNN2P12: Information Gain = 0.21853193143502136
DNAI3: Information Gain = 0.21853050472551327
IMPDH1P2: Information Gain = 0.21852787431071108
MIR4523: Information Gain = 0.21851635166987804
MIR4675: Information Gain = 0.21851474689915507
SNORD34: Information Gain = 0.21849411075184388
RPS23P1: Information Gain = 0.21848786705704026
HENMT1: Information Gain = 0.21848461833662958
GNRH1: Information Gain = 0.21846534545465257
C5AR2: Information Gain = 0.21846079127856144
ARX: Information Gain = 0.21845725973983932
LUADT1: Information Gain = 0.2184523070646942
RPS5P2: Information Gain = 0.21845073406347582
SLCO1A2: Information Gain = 0.21843826909010544
GDAP1L1: Information Gain = 0.21842502246462492
NADK2-AS1: Information Gain = 0.21842004715008567
SLC6A19: Information Gain = 0.2184132595687691
HBQ1: Information Gain = 0.21841267403160147
LRP1: Information Gain = 0.21841096369940338
HMGN2P10: Information Gain = 0.21840909365395555
PLAC1: Information Gain = 0.2184012608821373
ANKRD49P1: Information Gain = 0.2183974143767382
RPL36AP45: Information Gain = 0.21839248643525333
MIR6872: Information Gain = 0.21838972705956383
MAGEE1: Information Gain = 0.21838178270892317
CCDC200: Information Gain = 0.2183748526474505
CBX3P1: Information Gain = 0.21837426838595264
CALCB: Information Gain = 0.21836616291898636
LINP1: Information Gain = 0.21836282617181668
RPL32P16: Information Gain = 0.21835615511947637
PRL: Information Gain = 0.21835252990795495
PBX1-AS1: Information Gain = 0.21833044125730905
MTHFD2P7: Information Gain = 0.2183147582299656
FENDRR: Information Gain = 0.21831319798843984
FOXD3-AS1: Information Gain = 0.21830468054501773
RPL22P1: Information Gain = 0.21830459222605247
MIR193BHG: Information Gain = 0.21829333309872712
FNDC3CP: Information Gain = 0.21828682668042276
RNF213-AS1: Information Gain = 0.21827409992522973
ARHGEF18-AS1: Information Gain = 0.21827287662526262
ZNF221: Information Gain = 0.21827231506189393
EVX1: Information Gain = 0.21827164320004955
ROBO3: Information Gain = 0.21827131225526886
SNORA50A: Information Gain = 0.2182622739246205
RBMS1P1: Information Gain = 0.21826216030403556
GOLGA8H: Information Gain = 0.21826067768691892
MIR6836: Information Gain = 0.2182530447279316
LINC02895: Information Gain = 0.2182396768678867
GPR55: Information Gain = 0.21823519105439537
KRTAP1-3: Information Gain = 0.21822802749604064
TNNC2: Information Gain = 0.21822748024524752
APOB: Information Gain = 0.2182262672816926
PCNPP3: Information Gain = 0.21820850046769547
AFTPH-DT: Information Gain = 0.2182074089441126
ATP5F1EP2: Information Gain = 0.21820463867779827
EEF1A1P2: Information Gain = 0.21820187443796302
F8A3: Information Gain = 0.21819016413241865
HCG27: Information Gain = 0.2181786222225255
LINC02816: Information Gain = 0.218175315897281
VN1R83P: Information Gain = 0.2181508311070366
BHLHE41: Information Gain = 0.21814848844719914
APLF: Information Gain = 0.21814490759195926
SERPINA4: Information Gain = 0.21814382095197415
MMP21: Information Gain = 0.21814366815118835
MACROD2-IT1: Information Gain = 0.21814014814475824
TMEM132E: Information Gain = 0.21813339361559736
LBX1-AS1: Information Gain = 0.21813339361559736
BNC2-AS1: Information Gain = 0.21813339361559736
OXGR1: Information Gain = 0.21813339361559736
HTR5A: Information Gain = 0.21813339361559736
RNU6-460P: Information Gain = 0.21813339361559736
GTF2IRD2P1: Information Gain = 0.21813179945935568
CHST9: Information Gain = 0.21812952807000974
ZBBX: Information Gain = 0.2181250912745092
LINC02019: Information Gain = 0.21812410613795574
NPR3: Information Gain = 0.2181183761661456
LINC01311: Information Gain = 0.21811385143609296
PRSS29P: Information Gain = 0.21811179723687735
KRT8P4: Information Gain = 0.21809924248582013
DSC1: Information Gain = 0.21809819276053077
KAT7P1: Information Gain = 0.21809468035438706
RNVU1-2A: Information Gain = 0.2180803787682788
ANO7L1: Information Gain = 0.21806728999701308
RPS26P15: Information Gain = 0.21806094332365067
PRKN: Information Gain = 0.2180590403201117
INSC: Information Gain = 0.21805576379113223
HPCAL4: Information Gain = 0.218055763791132
CAHM: Information Gain = 0.21804945710717716
SLC12A4: Information Gain = 0.21804865565576725
COX6CP2: Information Gain = 0.21804092279527842
ZDHHC1: Information Gain = 0.21803640231081922
MBLAC1: Information Gain = 0.21802641993426253
CORO1A: Information Gain = 0.2180192096953517
MYL12BP2: Information Gain = 0.21801598616762097
CASS4: Information Gain = 0.21800941593882062
MTND4LP7: Information Gain = 0.21800450527468374
RN7SL89P: Information Gain = 0.2180030211750985
LINC00997: Information Gain = 0.21799841305767198
ZNF517: Information Gain = 0.21799416716280628
LRIG2: Information Gain = 0.21799385064446364
EPB41L4A-AS1: Information Gain = 0.21799380521396983
GUCY1B1: Information Gain = 0.21799098644710257
ACTR1AP1: Information Gain = 0.21798651838105076
PRRT4: Information Gain = 0.21797841583857447
LINC02443: Information Gain = 0.2179675260457885
ACTBP12: Information Gain = 0.2179627584178998
ANAPC1P2: Information Gain = 0.21795422077732418
PDE4DIPP7: Information Gain = 0.21794730435139353
NACA2: Information Gain = 0.2179458533101759
PRIM1: Information Gain = 0.21794423342980562
H2AZP1: Information Gain = 0.21793935033906453
ARHGAP26: Information Gain = 0.21793921789133708
TMEM145: Information Gain = 0.2179351426738667
KCNQ4: Information Gain = 0.21793071037078215
CCDC181: Information Gain = 0.21792565053926416
RPSAP6: Information Gain = 0.21791511681504216
RNA5SP437: Information Gain = 0.21791474534799105
MIR2110: Information Gain = 0.21791131283489218
RNFT1P3: Information Gain = 0.21790896797820736
SLC4A1: Information Gain = 0.21790860960916758
SNORD36B: Information Gain = 0.21790657770287458
MTND5P1: Information Gain = 0.21790217441974624
ADAM11: Information Gain = 0.217900822548714
EDIL3-DT: Information Gain = 0.21789339024035925
ANKRD18B: Information Gain = 0.217890608886822
TMPRSS11A: Information Gain = 0.21788520049182125
SMAD5: Information Gain = 0.21788326958240667
ZCCHC18: Information Gain = 0.21787973113883385
MBTPS1-DT: Information Gain = 0.2178758990730918
NRSN2-AS1: Information Gain = 0.21786473058161726
ZSCAN5C: Information Gain = 0.2178629868492914
DEFB1: Information Gain = 0.2178507108258405
DIAPH2-AS1: Information Gain = 0.21785050254334704
HOXB6: Information Gain = 0.21782942864708477
MIR4284: Information Gain = 0.21781454052775717
CFAP69: Information Gain = 0.21780666739021148
HNRNPA1P46: Information Gain = 0.21780654922229736
CCDC152: Information Gain = 0.21780554887691084
IL21R: Information Gain = 0.21780212465508142
IL21R-AS1: Information Gain = 0.21780212465508142
ANKRD20A19P: Information Gain = 0.21778157708434964
GRIA3: Information Gain = 0.21777715591485958
CCNJP2: Information Gain = 0.21777570694189174
CORO2B: Information Gain = 0.2177738286431925
MIR181B2: Information Gain = 0.21776179791172323
NOS2: Information Gain = 0.21776179791172323
THSD8: Information Gain = 0.21775805486931077
PTCH2: Information Gain = 0.21775187602888324
NIFKP4: Information Gain = 0.2177327470888053
NCMAP: Information Gain = 0.2177326643992088
ACTBP7: Information Gain = 0.2177245394008005
NME5: Information Gain = 0.21772034268951068
RNU6-1285P: Information Gain = 0.21771459540926097
TTC4P1: Information Gain = 0.2177111520455579
PMS2P11: Information Gain = 0.21770687355614093
FAM43B: Information Gain = 0.21770504074346153
GVINP1: Information Gain = 0.21769524175339505
MEF2C-AS1: Information Gain = 0.21769524175339505
MEGF10: Information Gain = 0.21769524175339505
FAM166C: Information Gain = 0.21769524175339505
PTCHD3P2: Information Gain = 0.21769524175339505
TRABD2B: Information Gain = 0.21769524175339505
KCNMB2: Information Gain = 0.21769524175339505
IGF1: Information Gain = 0.21769524175339505
RPL7P58: Information Gain = 0.21769524175339505
ROCR: Information Gain = 0.21768028659603478
VGLL1: Information Gain = 0.21767143468844852
ACTP1: Information Gain = 0.2176623004854099
BMP8A: Information Gain = 0.21765203522140641
ASTN2: Information Gain = 0.2176440629675691
LRFN2: Information Gain = 0.21763947325552468
CNTNAP3C: Information Gain = 0.21763602644129487
BCAS2P1: Information Gain = 0.2176331678422634
CICP13: Information Gain = 0.21762148741165266
LINC02463: Information Gain = 0.21762148741165266
ZNF658: Information Gain = 0.21761202126032564
TXNDC8: Information Gain = 0.21761104491430472
ABHD14A-ACY1: Information Gain = 0.21760553997468168
CDH17: Information Gain = 0.21760089054872056
DYNAP: Information Gain = 0.21759608975694467
LONRF3: Information Gain = 0.2175937375929331
LINC01091: Information Gain = 0.2175742816286872
PNPLA1: Information Gain = 0.21756303672029054
GCATP1: Information Gain = 0.2175603578098475
GNMT: Information Gain = 0.21755900046041998
SEC61G: Information Gain = 0.21755153594827803
SBK2: Information Gain = 0.21755134294672707
AOC2: Information Gain = 0.21755004037919967
TMEM169: Information Gain = 0.21753378052029637
ELAVL2: Information Gain = 0.21752450062181028
RTKN: Information Gain = 0.21749057805583671
CHID1: Information Gain = 0.2174877451201851
SLC4A1APP1: Information Gain = 0.21748081457130408
PICART1: Information Gain = 0.21746697910151802
PDC-AS1: Information Gain = 0.2174657670073521
CLDN14: Information Gain = 0.21746009395252686
SNORA63D: Information Gain = 0.21745070780632925
FBLN2: Information Gain = 0.21744665946254238
RPL23AP12: Information Gain = 0.2174426627006194
PDCL3P2: Information Gain = 0.21744024432997122
PTTG2: Information Gain = 0.21742561336448984
ADORA3: Information Gain = 0.21741106741858762
ARHGAP31: Information Gain = 0.21740721738668678
RNY3P15: Information Gain = 0.21740329119165258
DYNLT3P2: Information Gain = 0.2174017967647468
LIG1: Information Gain = 0.21739685362061656
ZFPM2-AS1: Information Gain = 0.2173953082072706
SELENOP: Information Gain = 0.2173840750436038
FBLN7: Information Gain = 0.21738316176911754
P2RX5: Information Gain = 0.2173460991662406
SPRY4: Information Gain = 0.2173388090199635
MIR6859-1: Information Gain = 0.21733845863488988
CSTA: Information Gain = 0.2173381692434344
JMY: Information Gain = 0.217317268231064
HCAR3: Information Gain = 0.2173170657126846
CGB3: Information Gain = 0.21731628475157816
KRT18P6: Information Gain = 0.21731542471894572
USP51: Information Gain = 0.21730187145994884
WASIR1: Information Gain = 0.2172956478638528
ACER2P1: Information Gain = 0.21728569062909053
MIR365A: Information Gain = 0.21728569062909053
CSMD2: Information Gain = 0.21728569062909053
ENPP7P7: Information Gain = 0.2172805720349278
RNU4-78P: Information Gain = 0.21727394192187055
CHST1: Information Gain = 0.21727211130840485
LINC00648: Information Gain = 0.2172658803271983
LINC01361: Information Gain = 0.21724816890456844
IQCN: Information Gain = 0.21724564707048488
MIR7851: Information Gain = 0.2172411817181541
C1QTNF1: Information Gain = 0.2172312127464111
SPATA45: Information Gain = 0.21722902238769626
PLCL2: Information Gain = 0.21721768031961752
FAM114A1: Information Gain = 0.2172102006624259
GATA1: Information Gain = 0.21720977322722357
CTBP2P8: Information Gain = 0.21719631765919267
ATP13A4: Information Gain = 0.2171879146570006
RPS17P5: Information Gain = 0.21718636244704737
PPP1R2: Information Gain = 0.21717598394837223
FYB1: Information Gain = 0.21717443696043826
RBMXP3: Information Gain = 0.2171690274613578
RNU6-481P: Information Gain = 0.217161188217176
C16orf96: Information Gain = 0.2171549757040716
CALM2P3: Information Gain = 0.21715096915861287
NEXN: Information Gain = 0.21714475496873087
ZXDA: Information Gain = 0.21714397383147133
TPRKBP2: Information Gain = 0.21714032767874847
DHX58: Information Gain = 0.21713986225106785
IL1A: Information Gain = 0.21713825445401058
C20orf144: Information Gain = 0.21713279815942932
C19orf71: Information Gain = 0.2171307984984321
MIR1234: Information Gain = 0.217128658196305
SLC38A3: Information Gain = 0.21712443341670062
LINC02904: Information Gain = 0.21712273001158078
PPIAP31: Information Gain = 0.217117795170638
RPL21P135: Information Gain = 0.2171126157766352
SASH1: Information Gain = 0.21710783857861604
U2AF1L5: Information Gain = 0.21710530777738501
NPAS2-AS1: Information Gain = 0.2170981403422616
RSPO1: Information Gain = 0.217095958518972
POU3F2: Information Gain = 0.2170921351338655
C8orf74: Information Gain = 0.21708881583051332
FRMPD1: Information Gain = 0.21708451944124074
LINC00942: Information Gain = 0.21708207429312543
KRT18P40: Information Gain = 0.21708192684648941
MIR600: Information Gain = 0.21707962143840742
DSEL: Information Gain = 0.21707107941994574
RMDN2-AS1: Information Gain = 0.2170698167966847
RNU6-455P: Information Gain = 0.21706655433673294
AGGF1P1: Information Gain = 0.21706107083108717
GAPDHP24: Information Gain = 0.21705570298510857
MT1L: Information Gain = 0.21705462773979955
LINC01907: Information Gain = 0.21705268279858636
CD4: Information Gain = 0.21704589468787883
PZP: Information Gain = 0.21704406451993918
SMPD4P1: Information Gain = 0.21703905476722762
EPCAM-DT: Information Gain = 0.21703556477388197
UBE2Q2L: Information Gain = 0.21699700471936012
NCF2: Information Gain = 0.21699500767689583
PAX7: Information Gain = 0.2169941699330693
IPO8P1: Information Gain = 0.21699229555721433
CCDC160: Information Gain = 0.21698744203768716
AKR1B1: Information Gain = 0.21698601667435335
KCNH6: Information Gain = 0.21696746214903317
RPS4XP19: Information Gain = 0.21696746214903317
RPL22P16: Information Gain = 0.21695703721499404
LINC02615: Information Gain = 0.21694994364489606
BOD1L1: Information Gain = 0.21694966425446638
DUTP7: Information Gain = 0.21694606014705475
RPS29P7: Information Gain = 0.21694109084101632
INSL6: Information Gain = 0.2169365667946841
AQP7: Information Gain = 0.2169332076885626
MIR3189: Information Gain = 0.21692797875573588
EVPLL: Information Gain = 0.21690969446397346
SLC19A3: Information Gain = 0.2168981123142968
RPS3AP29: Information Gain = 0.2168981123142968
LEF1: Information Gain = 0.21688886612548375
RPS17P1: Information Gain = 0.21688885922376477
TRAV27: Information Gain = 0.21688804437974185
MSLN: Information Gain = 0.21688036267199107
TRIM34: Information Gain = 0.2168734835488917
ICMT: Information Gain = 0.21685828852021616
HAS2: Information Gain = 0.21685563747627357
SNORD38A: Information Gain = 0.21684927635973095
TNKS: Information Gain = 0.21684218030101254
LINC02694: Information Gain = 0.21684217317607923
STX8P1: Information Gain = 0.21684064617325638
ST6GALNAC4: Information Gain = 0.21683235842904636
NME2P2: Information Gain = 0.21682738598047568
ARPP21: Information Gain = 0.21682738598047568
GRASLND: Information Gain = 0.21682738598047568
PAX2: Information Gain = 0.21682738598047568
RFTN1: Information Gain = 0.2168270497136553
VSTM2A: Information Gain = 0.21681812429843084
CTRB1: Information Gain = 0.21681325385211103
SCARNA1: Information Gain = 0.21679721191560986
PIH1D2: Information Gain = 0.21679568134850968
FAM13C: Information Gain = 0.21679250738999922
PLPPR3: Information Gain = 0.21678690600990658
PRDX3P2: Information Gain = 0.216780577697262
TMEM190: Information Gain = 0.21678047623642893
HMCN2: Information Gain = 0.21677562902456704
RNU6-1280P: Information Gain = 0.21677189747978431
KRTDAP: Information Gain = 0.2167663767282899
SNORA79B: Information Gain = 0.2167596882070819
PSMD7P1: Information Gain = 0.21675379694679697
PRKY: Information Gain = 0.21673719839028815
APOOP2: Information Gain = 0.21673169688305816
CCL26: Information Gain = 0.21671340302890196
YBX1P10: Information Gain = 0.2167050304783511
PTAFR: Information Gain = 0.2167008636923069
ZNF441: Information Gain = 0.21668938050400133
FAM87B: Information Gain = 0.216687403252668
TUBAP4: Information Gain = 0.21668501083537262
S100A3: Information Gain = 0.21668501083537262
GNG8: Information Gain = 0.21668445577425444
TAS2R13: Information Gain = 0.2166831108449494
SERPINA9: Information Gain = 0.2166781070095687
PPIAP85: Information Gain = 0.2166730939763537
ZBTB46: Information Gain = 0.21667174943989997
RPL31P63: Information Gain = 0.21667152383871824
LYPLA2P1: Information Gain = 0.21666571247365995
BLZF2P: Information Gain = 0.21666433455762224
EXOC3L2: Information Gain = 0.21666379254156598
SLC2A7: Information Gain = 0.21666253075649333
GASAL1: Information Gain = 0.21665781982847032
CENPF: Information Gain = 0.21665733164452217
NKX2-1: Information Gain = 0.21665453945103375
C9orf57: Information Gain = 0.21665277923668547
OR6K4P: Information Gain = 0.21665277923668547
PDGFRB: Information Gain = 0.21665277923668547
CTSLP2: Information Gain = 0.21665277923668547
FOXQ1: Information Gain = 0.21664506877130885
SERHL2: Information Gain = 0.2166383325714929
CATSPER1: Information Gain = 0.21663496701672336
KLF2P1: Information Gain = 0.21662778394108684
PHF3: Information Gain = 0.21660993884036572
TG: Information Gain = 0.21660962618720325
CCL4L2: Information Gain = 0.21660785340334843
CNTNAP3B: Information Gain = 0.21660785340334843
LINC00955: Information Gain = 0.21660357562735522
MIR1825: Information Gain = 0.21659394116965847
GAPDHP23: Information Gain = 0.2165865394437576
RPL10AP2: Information Gain = 0.21658364773153682
RBMX2P3: Information Gain = 0.21658148209533157
C1QTNF3: Information Gain = 0.21658039117602046
PNPO: Information Gain = 0.21657631170059966
NFYCP2: Information Gain = 0.21657588818150986
PPIAP40: Information Gain = 0.21657260309999593
MUC4: Information Gain = 0.21656663509097762
XKR7: Information Gain = 0.21656554954672913
KCNQ2: Information Gain = 0.21655936442579393
KIAA1210: Information Gain = 0.21655762575110993
RPL32P6: Information Gain = 0.21654826415155815
TMEM266: Information Gain = 0.2165439208826052
GALNT15: Information Gain = 0.21653850356861226
RPS15AP6: Information Gain = 0.21653737208098267
ZNF532: Information Gain = 0.21653480378179535
MIR4720: Information Gain = 0.21653237775056144
RPL21P93: Information Gain = 0.21652092269660095
SHISAL2A: Information Gain = 0.21652092269660095
KRT18P56: Information Gain = 0.21650922001917516
SPSB3: Information Gain = 0.21650793417807734
JAM2: Information Gain = 0.21650691501296926
SUMO2P1: Information Gain = 0.21650509258900796
FOXP1-AS1: Information Gain = 0.21648863675460728
INCA1: Information Gain = 0.21647406555754345
C20orf27: Information Gain = 0.21647024350939215
NAT8B: Information Gain = 0.21647017887045927
SARM1: Information Gain = 0.21645512167753278
ST3GAL1-DT: Information Gain = 0.21644936514781143
SEC14L5: Information Gain = 0.2164462997387222
MAGEC3: Information Gain = 0.21644012774805876
SHLD2P3: Information Gain = 0.21643905857136914
HMGN1P8: Information Gain = 0.2164264581144908
COL4A2: Information Gain = 0.2164242961901519
LINC00460: Information Gain = 0.21642129775520402
MIR3139: Information Gain = 0.21642129775520402
MYO1G: Information Gain = 0.2164212977552038
LINC02595: Information Gain = 0.21641577677632884
C1QL1: Information Gain = 0.21640352042790023
MIR155: Information Gain = 0.21640007946139717
MYBPC1: Information Gain = 0.21640007946139717
CDCP1: Information Gain = 0.21640007946139717
SFTPA1: Information Gain = 0.21639608380866737
ABHD12B: Information Gain = 0.21638984749788848
MYO7A: Information Gain = 0.21638886278064207
RPL13AP2: Information Gain = 0.21638683071813514
POLG-DT: Information Gain = 0.21638420393448476
KLK4: Information Gain = 0.21638320856719906
SPINK5: Information Gain = 0.21637761972647085
SLC9A9: Information Gain = 0.2163656275947996
DIS3L-AS1: Information Gain = 0.2163635392588319
C5orf46: Information Gain = 0.2163626948484998
RPL19P20: Information Gain = 0.2163626948484998
CNTN2: Information Gain = 0.21636269484849957
TSPOAP1: Information Gain = 0.21636269484849957
LINC01338: Information Gain = 0.21636269484849957
TRPM2: Information Gain = 0.21636238389005924
LINC00167: Information Gain = 0.21635729871562437
FBXL19: Information Gain = 0.21635459270943413
LINC00840: Information Gain = 0.2163521316286392
NBEAP1: Information Gain = 0.2163521316286392
KCNT1: Information Gain = 0.2163441403231796
GUCA1A: Information Gain = 0.2163441403231794
GPHA2: Information Gain = 0.216339823408791
SRMP2: Information Gain = 0.21633279535720473
NMD3P1: Information Gain = 0.21633065904017634
KIAA1217: Information Gain = 0.21632328879369345
CYP2T3P: Information Gain = 0.21632314954167065
AJAP1: Information Gain = 0.21631991173882437
APOBEC3B: Information Gain = 0.2163186220532538
SPAG16: Information Gain = 0.2163172603817849
BEAN1: Information Gain = 0.21630292201080858
OR7E22P: Information Gain = 0.21630029791383065
CYP3A7: Information Gain = 0.21629598377078874
CYP3A7-CYP3A51P: Information Gain = 0.21629598377078874
ZDHHC22: Information Gain = 0.21627412528759216
LINC02335: Information Gain = 0.21627241703674605
SLN: Information Gain = 0.21626422726858308
ITGA6: Information Gain = 0.2162604292188861
ENTPD8: Information Gain = 0.21625782872028543
FOXA3: Information Gain = 0.2162562359631235
OR52K3P: Information Gain = 0.2162562359631235
KRTAP9-12P: Information Gain = 0.2162562359631235
RPL36P2: Information Gain = 0.2162562359631235
RPS3AP26: Information Gain = 0.21624966639930632
TPBGL: Information Gain = 0.21623938491755035
SIRT4: Information Gain = 0.21623894208811412
LRRC4C: Information Gain = 0.21623640355147522
LINC01238: Information Gain = 0.21622738458704105
C22orf23: Information Gain = 0.21621872700990896
TPI1P2: Information Gain = 0.21621681299290518
LINC01186: Information Gain = 0.21621113925116742
RN7SL354P: Information Gain = 0.21620876978516534
CARNMT1-AS1: Information Gain = 0.21619759089851898
NMRK2: Information Gain = 0.216196679587068
RCC2P6: Information Gain = 0.21618804496408162
ZNF571-AS1: Information Gain = 0.21618788866126737
SEPHS1P6: Information Gain = 0.2161831626877997
AP1M2P1: Information Gain = 0.2161782987807559
CDC42-IT1: Information Gain = 0.21617318652173378
UFM1P2: Information Gain = 0.21617147432500072
SCN3B: Information Gain = 0.21616847077334733
PKNOX2: Information Gain = 0.21616833160306714
APOBEC3G: Information Gain = 0.21616833160306692
IRAK2: Information Gain = 0.21616470344363048
GALNT16: Information Gain = 0.21616270960972495
AGO4: Information Gain = 0.21615696643194093
POTEG: Information Gain = 0.21615365912147588
LINC00626: Information Gain = 0.2161351709659125
WFDC3: Information Gain = 0.21613172963483063
MYOM1: Information Gain = 0.216131183620351
CBX3P2: Information Gain = 0.21612198490105317
ZWINT: Information Gain = 0.21612163705223741
EEF1A1P1: Information Gain = 0.21611404024657488
OR10AC1: Information Gain = 0.21611353407991518
LIPM: Information Gain = 0.2161125669876125
RPL37P2: Information Gain = 0.21611112194050852
YPEL4: Information Gain = 0.21609674632994946
TCAF2C: Information Gain = 0.21609196036487743
PIGHP1: Information Gain = 0.21608638962336468
TBCAP1: Information Gain = 0.2160789881092926
MT-TG: Information Gain = 0.21606115393393188
C1GALT1C1L: Information Gain = 0.21605698749857183
BEX1: Information Gain = 0.21605639134333754
C1QL4: Information Gain = 0.21605120497170605
DUSP5-DT: Information Gain = 0.21604799944094966
KRT15: Information Gain = 0.21602991152289963
CMPK2: Information Gain = 0.21602780886734152
ADRA2B: Information Gain = 0.21602372410086468
CXCL8: Information Gain = 0.21601906871373422
COP1P1: Information Gain = 0.21601763841439903
SMYD3-AS1: Information Gain = 0.21601763841439903
ODF3: Information Gain = 0.21601763841439903
VSTM4: Information Gain = 0.21601763841439903
BTF3L4P1: Information Gain = 0.21601763841439903
ARMC3: Information Gain = 0.21601559310011642
SEMA7A: Information Gain = 0.21601409339919542
MIR1972-1: Information Gain = 0.21601147735669413
RNU2-27P: Information Gain = 0.21600509723673267
PRKCQ: Information Gain = 0.2160037107105981
RPL32P27: Information Gain = 0.2160003712937637
RNA5SP141: Information Gain = 0.2159912610978818
HLA-DMB: Information Gain = 0.2159778504013632
MIR3621: Information Gain = 0.21596837320129003
ITPRIP-AS1: Information Gain = 0.21596728183526515
P3H4: Information Gain = 0.21595315207068877
NCR3: Information Gain = 0.21595142589755456
LINC01228: Information Gain = 0.21594857070292628
LINC00494: Information Gain = 0.21594432224950189
ESYT3: Information Gain = 0.21593762453227483
EEF1A1P11: Information Gain = 0.21593508375865134
PTGIS: Information Gain = 0.2159311643536781
RSL24D1P1: Information Gain = 0.2159311643536781
CHMP5P1: Information Gain = 0.21592633137020578
EGR2: Information Gain = 0.21592460546702408
PTPRC: Information Gain = 0.2159245967865473
LINC01114: Information Gain = 0.21592219939490964
HOXD8: Information Gain = 0.2159221233248223
RNY1P15: Information Gain = 0.21591479646029144
KIAA0408: Information Gain = 0.2159135262903924
TFGP1: Information Gain = 0.21589767904656165
PPP4R1-AS1: Information Gain = 0.2158809958932102
ACTG1P3: Information Gain = 0.2158755353465167
LINC01933: Information Gain = 0.21587522521546054
CCL3: Information Gain = 0.21587522521546054
TUBBP2: Information Gain = 0.2158713915159871
FRMD5: Information Gain = 0.21587020935345214
SGCD: Information Gain = 0.21586907805741462
ARPP19P1: Information Gain = 0.21586066634287016
MIR6740: Information Gain = 0.21585928073887484
PEG10: Information Gain = 0.21585729742333282
HMGB1P3: Information Gain = 0.2158533189174361
RPSAP69: Information Gain = 0.2158390261892571
RSL24D1P6: Information Gain = 0.21583400690308951
SUMO2P6: Information Gain = 0.21582896398399676
MIR5006: Information Gain = 0.21582758688149295
TNIP1: Information Gain = 0.21581936250946887
SNHG28: Information Gain = 0.21580728678982042
RNA5SP37: Information Gain = 0.21580519018581956
RBM11: Information Gain = 0.21580354762818077
PRKAG2-AS1: Information Gain = 0.2158016460513983
RN7SL775P: Information Gain = 0.21579747422632778
IL11RA: Information Gain = 0.21579655757742167
LINC01305: Information Gain = 0.2157928794167434
ATP6V0E1P3: Information Gain = 0.2157873208554042
RN7SL4P: Information Gain = 0.21578726975650664
CRBN: Information Gain = 0.21578010092751088
MON1A: Information Gain = 0.2157798877230519
CCR2: Information Gain = 0.2157663436785724
SLC6A20: Information Gain = 0.2157663436785724
LINC02533: Information Gain = 0.21574800250298032
LINC01362: Information Gain = 0.21574736432810604
COL7A1: Information Gain = 0.2157449070627555
SNORD3B-1: Information Gain = 0.2157440983658201
DEPDC1P1: Information Gain = 0.21573862746648498
RASAL2-AS1: Information Gain = 0.21573243860861413
SNORD54: Information Gain = 0.21572836029175013
ACSM4: Information Gain = 0.21572196011239186
OR7E90P: Information Gain = 0.2157180145219788
H3P47: Information Gain = 0.21571264768917553
SETP22: Information Gain = 0.21571140345353745
VEGFD: Information Gain = 0.2157022198968599
GPBAR1: Information Gain = 0.21568759149041505
RN7SL466P: Information Gain = 0.2156842868152249
ABCB10: Information Gain = 0.21567260378733755
SCML2P1: Information Gain = 0.21566834548496816
ATP6V0E1P2: Information Gain = 0.21566745127350928
C1orf94: Information Gain = 0.21566745127350928
GCM2: Information Gain = 0.21566745127350928
SDR9C7: Information Gain = 0.21566745127350928
MAS1: Information Gain = 0.21566745127350928
FNDC7: Information Gain = 0.21566745127350906
NACAD: Information Gain = 0.21566745127350906
IFFO1: Information Gain = 0.21566560422389336
SPANXB1: Information Gain = 0.21566444905804527
PTMAP1: Information Gain = 0.21566247595897003
LINC02300: Information Gain = 0.21565966888421562
SRCIN1: Information Gain = 0.21565181144400714
OGFRP1: Information Gain = 0.215641550049525
TMEM121B: Information Gain = 0.21563387494107178
CATSPER3: Information Gain = 0.2156334061293752
LINC01978: Information Gain = 0.2156294927027238
RPS8P4: Information Gain = 0.21562250018029872
EVI2B: Information Gain = 0.21562250018029872
HES7: Information Gain = 0.21562081301757385
ZFP37: Information Gain = 0.21562081301757385
ALDH3B1: Information Gain = 0.21561996399684324
MIR544B: Information Gain = 0.21561977544617106
RPL7P9: Information Gain = 0.21561413500295434
KLHL38: Information Gain = 0.2156076784358183
RNU1-134P: Information Gain = 0.2156005785984232
RN7SL443P: Information Gain = 0.2156003550563632
G0S2: Information Gain = 0.2155996871303587
SLC7A9: Information Gain = 0.21558922571811556
PCSK1: Information Gain = 0.2155871293646363
DIRAS3: Information Gain = 0.21557954872635166
MIR23A: Information Gain = 0.21557954691345316
FAM157A: Information Gain = 0.21557943321630502
UPK3A: Information Gain = 0.21557600613118066
SLC9A7P1: Information Gain = 0.2155662489514356
RHEX: Information Gain = 0.21556099238813298
FLNC: Information Gain = 0.21556099238813298
SNORA20: Information Gain = 0.21556099238813298
KRT8P27: Information Gain = 0.21556099238813298
UQCRBP2: Information Gain = 0.21554251445118844
DNAJC28: Information Gain = 0.21553990944437684
WWP1P1: Information Gain = 0.215531894883789
SNORD52: Information Gain = 0.21553048377285755
CLLU1: Information Gain = 0.21552360777523538
MIR4513: Information Gain = 0.21552360777523538
DDX12P: Information Gain = 0.21552268050626577
HSPA2-AS1: Information Gain = 0.21552169521590914
CCND2-AS1: Information Gain = 0.21550151424321173
CCND2: Information Gain = 0.21550151424321173
RPL26P30: Information Gain = 0.21549808425176176
TNFAIP8: Information Gain = 0.21549647799360794
RGMA: Information Gain = 0.21549516085285414
ARHGAP44-AS1: Information Gain = 0.21548907063899603
MIR548O: Information Gain = 0.21548365124793722
MIR933: Information Gain = 0.21548365124793722
MIR6165: Information Gain = 0.21548365124793722
ENPP2: Information Gain = 0.21548365124793722
RNU7-40P: Information Gain = 0.21548365124793722
LINC02679: Information Gain = 0.21548365124793722
BRWD1-AS1: Information Gain = 0.21547491757639037
MIR34A: Information Gain = 0.21547456077307325
NOTO: Information Gain = 0.2154705606748415
SNORD70B: Information Gain = 0.21546900484678755
SEPTIN7P7: Information Gain = 0.21545738834214778
MYBL2: Information Gain = 0.21545168080328247
LRIG2-DT: Information Gain = 0.21544585358060364
RPP25: Information Gain = 0.2154426044828679
MIR30B: Information Gain = 0.21544112280623806
ZNF826P: Information Gain = 0.21543304728771262
RDM1P1: Information Gain = 0.21542401064573413
MIR6810: Information Gain = 0.21542311894423238
POLH-AS1: Information Gain = 0.2154061844830839
FZD1: Information Gain = 0.2153945549010603
RPL12P47: Information Gain = 0.21538934958283362
RPS7P14: Information Gain = 0.21538843847194844
RNU6-29P: Information Gain = 0.21538315376967576
C1GALT1: Information Gain = 0.21537339361129915
BZW1P2: Information Gain = 0.21537087391017362
RPL13AP7: Information Gain = 0.21537008554428994
PRAM1: Information Gain = 0.2153690548236149
EIF2S2P4: Information Gain = 0.2153689855072245
RBPMS2: Information Gain = 0.21536643305493475
SOX10: Information Gain = 0.21536639708175387
LINC00640: Information Gain = 0.21536516653982063
FAM133FP: Information Gain = 0.21536385292431137
FAM217A: Information Gain = 0.2153609708686297
LINC01068: Information Gain = 0.2153597860133123
LINC01864: Information Gain = 0.2153584948842573
MTATP8P2: Information Gain = 0.2153584948842573
ITGB1: Information Gain = 0.21534984870069795
HLA-DRB1: Information Gain = 0.21534648966546888
HSPA8P16: Information Gain = 0.21534522714072213
KLHDC7B-DT: Information Gain = 0.21534522714072213
ST18: Information Gain = 0.21534522714072213
LINC02223: Information Gain = 0.21534522714072213
COX6B1P4: Information Gain = 0.21534522714072213
HNRNPA1P47: Information Gain = 0.21534254161732047
NT5M: Information Gain = 0.21533013803377
OR7E37P: Information Gain = 0.21532706534333412
MIS18A-AS1: Information Gain = 0.21532622824636283
LINC02269: Information Gain = 0.2153224664564486
SLC4A9: Information Gain = 0.21531919577989522
ADCY5: Information Gain = 0.2153171104449878
MYCNUT: Information Gain = 0.21530784252782476
IL17REL: Information Gain = 0.21529819893515745
IGHV4-34: Information Gain = 0.21529761578398343
MAD2L1-DT: Information Gain = 0.21529743416387181
H3P11: Information Gain = 0.21529727930796416
RPL31P7: Information Gain = 0.21529727930796416
NLRP3: Information Gain = 0.21529727930796416
IGSF22: Information Gain = 0.21529436624562015
HMGA1P7: Information Gain = 0.21529353832321063
KRT85: Information Gain = 0.21529302854510335
KCNC2: Information Gain = 0.21529202187016172
SLC25A27: Information Gain = 0.2152909380685033
LST1: Information Gain = 0.21528928800250458
CICP9: Information Gain = 0.21528928800250458
TNFAIP6: Information Gain = 0.21528928800250458
FGG: Information Gain = 0.21528928800250458
LYG2: Information Gain = 0.215287932522368
FABP6-AS1: Information Gain = 0.21527444099815063
NOG: Information Gain = 0.2152642343375113
RP9: Information Gain = 0.21526144821718018
CLDN11: Information Gain = 0.21524806969013377
ANGPTL2: Information Gain = 0.21524806969013377
CSF3R: Information Gain = 0.21524806969013377
LINC01749: Information Gain = 0.21524806969013377
PRKAR2B-AS1: Information Gain = 0.21524806969013377
LINC00608: Information Gain = 0.21524806969013377
VAX1: Information Gain = 0.21524806969013377
RPL23AP35: Information Gain = 0.21524806969013377
CALCA: Information Gain = 0.21524806969013377
DBIL5P2: Information Gain = 0.21524806969013377
LYPLA1P3: Information Gain = 0.2152352712878851
MEAF6P1: Information Gain = 0.21523259620508028
ZMYND10: Information Gain = 0.21522644829480164
SLC8A3: Information Gain = 0.21522116281507486
DLG5-AS1: Information Gain = 0.21522008324514252
PDE1A: Information Gain = 0.21520172516904257
TRIM67: Information Gain = 0.21520138364244823
MEDAG: Information Gain = 0.21520138364244823
ITPRID1: Information Gain = 0.21520138364244823
YY2: Information Gain = 0.21519960163216556
RN7SL166P: Information Gain = 0.21518246758704596
UBE2S: Information Gain = 0.21518216374144994
TBPL2: Information Gain = 0.2151738344824239
CENPK: Information Gain = 0.21516974929627541
TMCO2: Information Gain = 0.21516542276787964
MMP10: Information Gain = 0.21516542276787964
KCTD9P1: Information Gain = 0.21516542276787964
WDHD1: Information Gain = 0.21515158381064592
SNORA73B: Information Gain = 0.21514902917257417
MEFV: Information Gain = 0.21514830469593993
PSMD8P1: Information Gain = 0.2151424640596249
YIPF7: Information Gain = 0.21514145916414318
MINAR2: Information Gain = 0.21513821768442343
ABCC6P2: Information Gain = 0.2151342197652386
ISOC2: Information Gain = 0.2151238489927907
TXNP5: Information Gain = 0.21511432164226507
PLAT: Information Gain = 0.2151134792823921
JAG1: Information Gain = 0.21510742884611989
LINC01185: Information Gain = 0.21510487966640413
TTYH2: Information Gain = 0.21509876120520421
CGB7: Information Gain = 0.21509323712432638
LINC02068: Information Gain = 0.21509081549534104
LINC01701: Information Gain = 0.21507708632841593
CALHM3: Information Gain = 0.21506968787240432
RPL37A-DT: Information Gain = 0.21506849639170844
ME3: Information Gain = 0.2150566081380605
CNTNAP3P1: Information Gain = 0.21505448279227068
ITGA6-AS1: Information Gain = 0.21505419393390768
PIGM: Information Gain = 0.21505395033764718
RPL7AP11: Information Gain = 0.21504315318763223
SERHL: Information Gain = 0.21503936649479694
LINC02052: Information Gain = 0.21503356622779557
NIFKP8: Information Gain = 0.21503356622779557
ACTN3: Information Gain = 0.21503356622779557
C20orf202: Information Gain = 0.21503356622779557
MAPK4: Information Gain = 0.21503356622779557
UROC1: Information Gain = 0.21503356622779557
OLFML2A: Information Gain = 0.21502965780147365
RN7SL253P: Information Gain = 0.21502811148425383
NFYBP1: Information Gain = 0.21502418479333896
HHIP-AS1: Information Gain = 0.21501123753541784
DKKL1: Information Gain = 0.21500896678703274
LINC00865: Information Gain = 0.21500001897198406
CCDC69: Information Gain = 0.214989132488141
In [90]:
filtered_genes1 = list(filter(lambda gene: information_gain[gene] > 0.215, sorted_genes))
r = len(filtered_genes1)
print(r)
3414
In [91]:
def plot_info(len_data, data):
    plt.figure(figsize=(10, 6))
    plt.bar(range(len_data), information_gain[data])
    plt.xlabel('Genes')
    plt.ylabel('Information Gain')
    plt.title('Information Gain for Selected Genes')
    plt.tight_layout()
    plt.grid(False)
    plt.show()
In [92]:
plot_info(r, filtered_genes1)

This is the list of selected genes:

In [93]:
for x in filtered_genes1:
    print(data.columns[x])
NDRG1
BNIP3
HK2
P4HA1
GAPDHP1
BNIP3L
MT-CYB
MT-CO3
FAM162A
LDHAP4
ENO2
HILPDA
ERO1A
PDK1
PGK1
VEGFA
C4orf3
LDHA
KDM3A
DSP
PFKP
PFKFB3
DDIT4
PFKFB4
GAPDHP65
CYP1B1
GPI
MTATP6P1
CYP1B1-AS1
AK4
IRF2BP2
BNIP3P1
MT-ATP8
MXI1
MT-ATP6
TLE1
FUT11
RIMKLA
UBC
IFITM2
CIART
TES
HK2P1
HIF1A-AS3
GBE1
MYO1B
GAPDH
P4HA2
SLC2A1
PGK1P1
ITGA5
NFE2L2
ALDOA
RSBN1
MT-TK
EIF1
FDPS
STC2
DYNC2I2
MT-CO2
PGAM1
TMEM45A
ENO1
ALDOAP2
PTPRN
MIR210HG
RUSC1-AS1
FOSL2
C8orf58
PYCR3
ELOVL2
RAP2B
HLA-B
BHLHE40
RIOK3
BHLHE40-AS1
KRT80
SOX4
P4HA2-AS1
CYP1A1
USP3
SNRNP25
TNFRSF21
TANC2
PSME2
GAREM1
IER5L
AK1
WDR45B
EGLN3
PGK1P2
EGLN1
GAPDHP72
PGP
CEBPG
SPOCK1
IFITM3
DAPK3
GNA13
HLA-C
ACTG1
NAMPT
DSCAM-AS1
CLK3
SLC9A3R1
PNRC1
IGFBP3
SPRY1
MIR6892
NEBL
BBC3
PGM1
ADM
QSOX1
DARS1
MKNK2
SLC27A4
EML3
EMP2
SDF2L1
ST3GAL1
TGIF1
GAPDHP70
MRPL4
DAAM1
LY6E
IDI1
TST
SLC9A3R1-AS1
IFITM1
HNRNPA2B1
CCNG2
TRAPPC4
VLDLR-AS1
GAPDHP60
LSM4
NCK2
ARPC1B
GABARAP
LDHAP7
TSC22D2
PRELID2
MSANTD3
RAD9A
POLR1D
MIR3615
CA9
PSME2P2
MKRN1
CTPS1
NTN4
NDUFS8
LDHAP2
NDUFB8
ZNF292
SRM
BTG1
OSER1
ELF3
CTNNA1
RNF183
DHRS3
MIR7703
KCMF1
FTL
C2orf72
DDIT3
STK38L
SMAD2
EGILA
SMAD9
IL27RA
FAM110C
RBPJ
ESYT2
TUBD1
ZNF160
PKM
TGFBI
TMSB10
MACC1
PAM
IGDCC3
ZYX
HMOX1
HELLS
SFXN2
FNIP1
GAPDHP61
TPD52
CRELD2
TXNRD1
RORA
WASF2
RAMP1
RND3
ZNF395
FYN
GAPDHP63
UHRF1
TUBG1
EIF4A2
KLF3
RHOD
DAPP1
AVL9
SLC3A2
TFG
TCAF2P1
RCAN3
PPP1CA
MIR5047
LRR1
YEATS2
MYL12A
BEST1
CLDND1
NUPR1
ARFGEF3
FTH1
HMBS
DUSP10
ALOX5AP
VLDLR
SINHCAF
RPL17P50
RNF19B
ZFAS1
FASN
PGM2L1
RRAGD
MYRIP
GGCT
KLF3-AS1
DCXR
TLE1P1
CDC42EP1
RPL34
PCAT6
EBP
DUSP4
CHD2
ANGPTL4
RUNX1
INSIG2
PHLDA3
GAPDHP40
RANBP1
POLR2L
RNASE4
DNPH1
HPDL
POP5
ATP5F1D
THAP8
WEE1
CCNI
SLC29A1
TRIB3
KLF7
FOXO3
PSME2P1
GNAS-AS1
FAM220A
ZNF12
NUDT5
MFSD3
ANG
DOK7
PRMT6
FBXL6
ELOVL6
VDAC1
STRA6
ASNSP1
HNRNPAB
CAPN2
SLITRK6
GRB10
FEN1
FBXO42
SLC25A36
CDC42EP3
GET1
PCBP1-AS1
FOXO1
HEY1
FAM13A
BCL10
FBXO16
PDZK1
PTGER4
TFRC
KDM5B
GINS2
VPS37D
ADCY9
LRATD2
NDUFC2
NECAB1
TKFC
TRIM16
CDC45
LINC02649
TMEM265
EDN2
DENND11
SRF
GPS1
FAM13A-AS1
PDLIM5
KLHL2P1
ATP5MC1
ZBTB21
CFD
EMX1
PLBD1
PTPRH
ATP5F1E
APEH
TCAF2
MAP1B
TMEM64
NECTIN2
NDUFS6
TMEM123
CERS4
LDHAP3
CD55
EIF4EBP1
PAGR1
ADAMTS19-AS1
SEC31A
FADS1
GPNMB
MSANTD3-TMEFF1
CHMP4C
TMEM65
IMMP2L
RLF
GAD1
SDAD1P1
ANKRD12
SNX27
RPL21
ASF1B
C1QBP
DHCR7
FADS2
ACLY
CENATAC-DT
FTH1P16
H2AX
VEGFC
LOXL2
MYO1E
CCDC28B
TUFT1
GAPDHP21
MOV10
BCL2
FLRT3
CBLB
TRABD2A
MYO10
MPV17L2
NDUFB1
WSB1
TEDC2
SDR16C5
OLFM1
KLF6
KPNA2
CEACAM5
PHTF1
ZNF84
SYT12
DHRS11
FDFT1
MYCBP
AZIN1
MYH9
ACOT7
DBI
TTC9
PPP1R10
MMP16
SLC25A10
SH3GL3
PSAP
DMRTA1
ATXN1-AS1
UNC5B-AS1
LIMCH1
FANCG
AGPS
BCAS1
DGKD
ARL8A
KCNK5
PCAT1
MEIKIN
TPT1-AS1
CDK2AP1
ATXN1
GPR179
IFFO2
KLF11
ACAT2
PCP4L1
GPR146
MB
BEND5
BCL2L12
COPS9
DOLK
PCBP1
ELOVL5
SHISA5
PLOD2
CSNK1A1
RNF149
ATAD3A
ATF4
RPL31
PALLD
PLOD1
C1orf116
ADGRF4
HLA-W
GYS1
TMOD3
KCNG1
TPX2
PTEN
TAF9B
BOD1
EDA2R
CHRNA5
HSD17B10
MALL
HAUS8
GADD45A
B4GAT1
ARF6
ZFAND1
RAB6A
USP3-AS1
ELL2
RET
ATF2
WDR45BP1
SIKE1
KRTAP5-2
PLIN5
GAS5
LRIG3
NRP1
GFRA1
CHAC2
ATXN3
TMEM104
ANKZF1
ULBP1
MICB
IFI35
HLA-E
PIK3R3
NFIL3
PHF19
CLVS1
ATP1B1
CDC25A
IDI2-AS1
NDUFC2-KCTD14
KLHL24
FBXO32
TMEM229B
TSPAN4
FCGRT
RAP1GAP
FAM167A
ENDOG
TMEM59
MVK
GAPDHP71
POLR3K
S100A13
FBXO38
LDLRAD1
MT-CO1
LAMC2
PPFIA4
ANXA1
GDF15
IL3RA
GPAT3
SPC24
UBE2QL1
MIR6728
MALAT1
PLAAT2
ACTG1P10
MYL12-AS1
GOLM1
MIR1199
EIF4B
CYB561A3
PPM1K-DT
MRPL28
CDCA7
CCDC74A
SLC25A39
C4orf47
ABHD15
ADM2
PYGL
FRY
FUOM
FTLP3
GPER1
ZNF689
GALNT18
RPS27
MIR181A1HG
POLA2
SCEL
FAM47E-STBD1
INSYN1-AS1
SAT1
FOXP1
SLC25A35
HLA-T
C6orf141
SERGEF
TRIM29
HAUS1
SPRR1A
APOBEC3A
SNTB1
RNF19A
YEATS2-AS1
ATIC
TMEM54
CENPM
P3R3URF-PIK3R3
GPR155
RYR2
SERINC3
CD9
CCN4
MAOB
RPL7
TNFRSF19
LDHAP5
LRP4
LPP
LNPK
NDUFA4L2
CAST
CISD3
CCSAP
NAPRT
METTL7A
CPEB2
WDR4
FTH1P20
TBC1D8B
SCARB1
FAM210A
PLD1
CDK5R2
MTHFD1
XPOT
PPP1R3C
MCM3
RPL23AP7
PPP1R14C
TPD52L1
UNC5B
FUT3
JPH2
SAMD4A
IGFLR1
MUC16
HLA-L
MRNIP
ZNF365
RCN1P2
RAPGEFL1
ADAT1
HINT3
SLC7A11
RIBC2
SAMHD1
GAL
CXADR
HSD17B1-AS1
SMAP1
ELOVL2-AS1
LOX
SHMT1
KRT83
NUP62CL
SPATS2L
RECQL4
TKT
PWWP3B
INSYN1
A4GALT
STING1
KRTAP5-AS1
SRPX
TBC1D3L
AGMAT
FRK
LATS1
KRT224P
GRM4
HOXA10
PDGFB
EIF2B3
PACSIN2
PPM1J
ST8SIA6-AS1
RNPEP
CBX5
PNMA2
ANXA2R
PAK6
GAPDHP73
EGFR
FAM111B
CDKN2AIPNL
SOGA3
MCM10
CD109
CDC20
AHR
HOXA13
KMT5B
GAPDHP64
C15orf65
FAM214B
SLC25A15
S100P
GAPDHP69
RIPPLY3
RAB3IL1
ALDOAP1
MCRIP2P1
SLC26A5
SQSTM1
TCP11L2
NDUFB10
POMGNT1
WDR76
CHTF8
OTULINL
LRATD1
WDR61
TTC36
DPF1
CFDP1
ETNK2
MIR7844
PARP1
ADGRF1
IRF6
LINC00623
MTCO3P12
GAPDHP35
MFSD13A
ARMC6
GET1-SH3BGR
CD320
MTHFD2
VAPA
MIF
ZNF367
ZNF148
SEMA4B
NECTIN3-AS1
PCCA-DT
KCND3
CAVIN1
ATP5F1A
PCLAF
DAPK2
SLC1A1
DCAF10
E2F2
GAS5-AS1
PPP1R14B-AS1
XPOTP1
H3C4
MRPL38
GOLGA6L10
NRGN
DTL
HSD17B1
RGCC
AIFM1
SNHG22
MRPL41
NT5DC2
CYP4F22
BEST4
NKAIN1
POLD1
TUBA3E
KLF13
LINC01214
GIHCG
STXBP5-AS1
CDKN3
TARS1
APOL4
H4C5
ZNF337
DHCR24
PPP2R5B
PARK7
CLPSL2
RTN4RL1
RNF144A
FAM86C1P
AKR1C1
H2AC7
EDN1
CBX4
MIF-AS1
MAP4K2
COA8
IFI30
BRCA1
GON7
RBBP7
SORL1
BSCL2
KRT4
FGF2
CDK5
DMC1
TUBA4A
FKBP5
CCDC107
H2AC9P
TMEM74B
NPC1L1
NDUFA4
DRAXIN
TMEM19
BMF
PLEKHG1
RNF180
HYMAI
IFI44
ARID5A
PLK1
CEACAM6
DNASE1L2
EEF1A1
TPSP2
STBD1
ZNF528-AS1
CYRIA
ENO1P1
ITGB3BP
HDHD5-AS1
TNFRSF18
SPATA18
TLCD1
SNTA1
MED15
ZNF682
AZIN2
HEATR6
ENOX1
RNU1-82P
ADRA2A
CCDC33
AMPD3
TNFRSF6B
HIGD1AP1
PLEKHO1
TLE6
ACTBP15
MITF
PKDCC
ARFRP1
FTH1P12
MIR210
MEF2A
REEP2
OTX1
VXN
SLK
PARM1
TSPAN12
NIBAN1
TOX2
CFAP418-AS1
MYBL1
MIR34AHG
SINHCAFP1
GLUD1P3
FTH1P15
ANAPC5
G6PC3
CASTOR3
BTG1-DT
TPM4
CYFIP2
DPAGT1
GATA2
ASNS
SEL1L
RUSC1
RN7SL674P
RCN3
CALM3
ABHD8
LPIN3
ZMPSTE24-DT
DNAAF10
SNW1
S100A4
LSS
DSC2
EGFR-AS1
DUSP2
MLKL
C21orf58
CRYBG3
POLE2
STX3
LERFS
EXOG
TOP2A
PLBD1-AS1
NAV1
ATP6V1G1
TK1
CFAP251
TPTE2
CAVIN2
KRT19
CLEC3A
RELN
EGR3
HMGN3
HES2
DUSP8
KIF5B
MCM6
HOXA10-AS
EFEMP2
CALR4P
DNER
BMF-AS1
GAPDHP68
SERPINE2
FBP1
BMS1P10
KRT18P46
MMP13
GAPDHP32
ADAMTS9-AS2
KBTBD2
SERTAD2
RGS20
C2CD2
MIR7113
PPP1R3E
ARID3A
ERICH6-AS1
STAG3
RAMP2
LRP4-AS1
GPR139
SYNE3
CPA6
GLRA3
ERLNC1
EEF1A1P13
WSCD1
PTTG1IP
SDK1-AS1
FLOT2
MFSD11
TOX3
PLXNA2
TNNT1
PHLDB2
LIN7A
IDS
ANXA3
SCGB2A1
DHX40
GLIDR
IL17RB
KRT16
ANK2
CHAF1B
ZMAT4
CYB5B
SRD5A3-AS1
SLC47A1
SPA17
LRP2
ACTG1P12
SMIM15
NAXE
ZNF524
THEG
RANGRF
FNDC10
ISOC1
TRIM16L
GPRC5A
MID1
ERRFI1
CCDC71
MLEC
TONSL
CCR3
COL9A2
C1QTNF6
COL17A1
TM7SF2
SYNGR3
KHDC1
RGS17
C1R
ACSS1
TENM3-AS1
SERINC1
LINC01659
FOXRED1
MUC12-AS1
FTH1P7
HERC3
TATDN1P1
KRT17
NUAK1
PGLYRP2
MCUB
MYORG
ACTR3C
TMCC3
NPY1R
LRRC45
BLNK
NAMPTP1
MIR3917
CSTF3
FOXP2
FOXI3
GAPDHP44
YPEL5
RN7SL1
PRKAA2
SPATA12
PTPRR
COQ4
DPCD
CCND3
ARHGEF28
MKRN4P
TMEM45B
ATP6AP1L
MIR6819
FTH1P8
SBK1
SUOX
MEAF6
MAGEF1
ATP5MG
RBP7
MAB21L3
GALR2
WASF4P
ARL6IP1P2
SARS1
MIR6811
ZNF766
DOCK11
CHST14
NUDT6
ECI1
SOWAHC
TOMM40P2
SEPHS1P4
RPS12P26
HSPB1P2
LONRF2
THEMIS2
CNPY4
DTYMK
ABCB8
TMEM132B
HS6ST3
SOD2-OT1
ID2-AS1
ETV6
CCDC74B
DPT
CSGALNACT1
KCNN1
ZNF70
TIGD3
RHPN1-AS1
MALRD1
KRT89P
DACT3-AS1
PPP1R3B
CHAC1
ATG14
SEPSECS-AS1
ARHGEF35-AS1
IL17D
STMN4
DEPDC4
GINS1
MRTFA
MUC5B-AS1
LRG1
AXL
MCOLN3
OR2A9P
TNFRSF10B
MELTF
PTH1R
ZNF264
RTL8B
MIR6830
DTNA
PKD1P6
OPLAH
FGD2
SUMO3
IGHE
ANXA2
CDYL
LINC01615
MRPL12
ASPM
CDC6
GTSE1
IFNAR2
FAS
UMODL1
SH3RF2
DIPK2A
E2F1
CORO1C
CDC42EP2
RUNX2
CCL22
MDK
MIR4743
GRPEL2
PALM2AKAP2
RAB37
SVIL
MAP7D2
PPP2CA-DT
NAGS
EMID1
C1QTNF7-AS1
GREB1
RNF41
NUDT1
SOX11
IFRD1
PPP1CB
CDH11
MIR761
ZBTB20-AS1
ZDHHC9
PDGFC
ADPRH
CPLANE2
RNU6-8
CYBA
TMCO3
RFX3-AS1
S1PR5
PKD2
FTH1P11
GOLGA2P5
ZNF610
MIR3198-2
DSCAM
SMARCE1P5
LIF
CAVIN2-AS1
LINC00526
CHML
SPTBN4
LINC00598
LNC-LBCS
C12orf60
CLGN
ARL2BPP4
KCTD11
CXCR4
ASPH
KIF4A
SKA3
HS3ST1
C19orf38
GRIN2C
CDKL2
SPRR1B
CENPX
DRAIC
NCMAP-DT
PAOX
YBX2
SEPTIN11
FCHO2-DT
LNX2
ZRANB1
NEK9
CEP19
LPAR3
NR3C1
WEE2
STMN1
OTOS
MIF4GD
NPEPPSP1
FAM177B
SIPA1L2
TMEM105
LINC02889
ANKRD22
PXDC1
GAMT
ISM2
TMPRSS9
FTH1P2
ARHGEF34P
GDAP1
NF2
SPRED1
BTC
TRIM60P18
MEX3D
IFI16
GDPD3
NAV2
MIR636
HSD17B14
CLPSL1
KCNJ8
GSC
PCAT7
LINC00636
PRRC1
HSH2D
TIMELESS
CREB5
TRAV18
PHC2-AS1
PTGFRN
PRELID1
SEMA6C
PAG1
OR7E39P
GLT1D1
AGBL2
FAM178B
ST13P6
LHX2
ZNNT1
HSPB1P1
CORO1A-AS1
THRIL
SNRPGP15
C2CD4C
DDX59
NPY5R
FYB2
MAP1A
COL13A1
ID4
IL12A-AS1
TAGAP-AS1
LINC00824
GOLGA5
GCNT3
OR7E126P
FDX2
KCTD17
PRICKLE2-DT
GBX2
EDARADD
IL20
FAM230I
MIR6785
RPL7P6
NUSAP1
CMKLR2
LRRC3
MAF
C14orf132
TNIK
DINOL
DNAH10OS
ARIH1
FGF13
RPL7P47
SWAP70
HS6ST2
LINC01977
LINC00629
LINC00866
MIR6765
ZNF304
PEX5
THRSP
FTH1P5
CDKN1A
STAB1
PHGDH
LINC01340
MCM7
ALOX5
ZMYM5
DCLK2
ECPAS
ABHD4
RPL4P6
FGFR4
KLKP1
SUMO2P17
ARHGAP22
P4HA3-AS1
SCGB1D2
SPATA6
SMU1P1
RSL1D1
ZNF460
MIDEAS
SND1-IT1
ACKR2
SUMO2P21
ANKRD34A
CAD
ZMAT1
TDRD12
TRBV30
RAC3
SULT2B1
C11orf98
ZNF841
P3H2
GJB5
SNAP91
HDLBP
NQO2-AS1
ANKRD1
CCDC80
KY
SPINK8
IL6R
PCDH20
ACTG1P20
RBP1
SPTLC3
GAPDHP38
OIP5
DNAJB6P2
SERPINB5
DHRS7
ESCO2
MIR4737
GATA5
NCAPH
CLSPN
MIR6833
PPP2R2A
MIR4428
CDH13
GAPDH-DT
RNF157
GJA3
TMTC1
ZNF853
GATA2-AS1
ATAD5
MIR4793
ZNF710
COL4A3
FTH1P10
PPFIBP2
TMPRSS13
AFAP1-AS1
NEK2
ANK1
SNORD35B
BTG3-AS1
MIR6730
BMP6
ZDHHC11B
MARK3
NCOR2
CALM2P2
ADAM20P1
IL18
SCHLAP1
CDH16
ZBTB20
LINC02343
ZNF697
OXER1
CCDC148-AS1
EIF2S2P3
ZNF654
KLHDC8B
EN2
EFNB1
ALDOC
HGH1
SNORD69
INTS4P1
NDUFB8P2
NBEAP5
MBOAT7
ACSBG1
LINC01016
EIF4H
LINC01529
FGD3
FAM83G
RRAS
STX17-DT
UBASH3B
CCDC137
HLF
PPP1R9A
IRF2-DT
CAPN8
DLX5
PTGES
KCNIP4
OXR1-AS1
LHX6
PIGW
VN1R48P
MIR6865
FEM1B
EMILIN3
MIR4640
IL17C
MIR6866
RNF122
LINC02656
ZNF295-AS1
SLC25A5
CCDC175
C7orf61
RASGEF1C
ABCC4
EMP1
CACNA1C
FBXL7
TFF2
SRD5A3
KRT87P
PLEKHB1
MANCR
GCHFR
HBEGF
DMRT1
TOMM40P1
GPR132
SNORD56
CNIH2
ALDH3A1
P2RX2
NKPD1
HEBP2
S1PR4
PRAP1
PCSK5
EFCAB6-DT
GPAA1
MT-TS2
IRX4
GUCY2C
SORCS1
ZFP69B
OR7E36P
SLC4A8
LARGE2
RACGAP1
FAM83E
LAPTM5
GABARAPL1
AFF3
KCNN3
SMPD5
OTOAP1
PPP1R14BP2
NEIL3
LINGO3
SPX
VCP
TMEM51-AS1
SMOC2
GATD3A
SFXN5
MIR6775
AGPAT4
ZNF333
CSRP2
NUGGC
RPL23AP49
ACRV1
ANTKMT
ATP6V1D
TCIRG1
CCDC87
NPIPB2
ELAC2
EIF4A1P5
KRT23
RACK1P1
MSLNL
HPGD
ADGRE2
USH1G
DLEU2L
SHLD1
EIF4BP5
TRPC6
SNORD62B
LINC01176
KCNJ3
CSF1
TSPAN13
CDKN2C
MASP1
MIR4751
PVRIG
LINC01164
FRG1HP
PLAGL1
CASC15
LCN2
PLA2G2A
THUMPD1P1
PLAAT4
RAB11FIP5
NDUFA13
NEDD9
NT5DC4
YWHAZP5
SOWAHA
PNMA6B
TRAV19
LKAAEAR1
ARMT1
LRRC10B
EEF1A1P22
LRAT
MARCKS
GCSHP5
SNORA10
CBR1
KRTAP5-1
MIR6891
DLGAP3
FGR
GSTA4
C3
SOCS3-DT
PSPC1-AS2
ALDH1L1
DSG2-AS1
TNFSF4
WNT3
ZNF135
AMD1
FAM184A
SEC1P
NECTIN4
LINC00160
CR2
CD68
SFTPA2
SNORA77B
MAB21L4
CTAGE15
PLAC9P1
SLC8A1-AS1
ANKRD17-DT
TRIL
EGFLAM
MIR6741
TUBB1
KCNK12
RUNX2-AS1
CLMN
VEPH1
ATP5MF
LINC01714
TPBGL-AS1
ADH6
RGL1
CASC19
DNAH10
RN7SK
UBE2L4
ARMC7
ADGRG5
DLGAP4-AS1
PHETA2
APLP2
GATA4
GTF2IP7
LMCD1
SNF8
TTC9-DT
FGFBP3
FAM91A2P
CDK18
CLUHP10
SPINK14
PTPDC1
DTX4
GSTM3P2
LDHAP1
SNORA12
NTF4
GAPDHP52
NUS1P2
CCT5P1
PRKCD
BHLHA15
RAET1L
LINC01732
PHC2
COLEC10
RASSF2
DSCC1
PGM5P2
ATP5PDP4
TENT4A
PPIC
HAAO
FOXRED2
LINC01918
SYT5
LINC01290
POU2F2
KCNJ18
KIZ-AS1
MIR339
SVIL2P
APBA1
RETN
ZNF337-AS1
TMEFF1
LINC02716
SERPINE1
MYLK3
ANO1-AS1
DBF4B
ASRGL1
USP30
SNX25P1
CYYR1-AS1
ADAM20
CEACAM7
SMARCD2
FAT2
ZNF732
ASTL
FRMD6
TNFAIP3
TRAF6
C1RL
LINC02428
LINC00173
PLEKHA2
SPIN1
BMP1
LINC01275
PDE6D
ACSM3
FBXL4
VWA5A
SHANK3
KRT19P1
TUBAP2
RPS3AP27
SYNGR1
MED28-DT
MRAP
MT-TM
LINC01517
RLIMP1
ERVE-1
RNU6-438P
MEF2C
INTU
ZNF285B
STK19B
C6orf58
LINC02352
C21orf62-AS1
AP1B1
VPS13B-DT
IFIT2
KANK3
TTC9B
FAM171A1
CNN2P9
CCNO-DT
DHRS9
PSMG3
DSG1-AS1
HKDC1
PEG13
HAS2-AS1
NEU1
CLIP3
OR11H13P
CCR8
GP2
PLCL2-AS1
ZNF133-AS1
LTB4R2
SNTN
CHSY3
TBC1D24
TENM4
GALNT6
GAL3ST1
TIGD2
USP2-AS1
CYCSP38
MIR3064
NR4A3
LINC01132
CDA
ACVR1
CES5AP1
GRM1
SHMT1P1
RMI2
IL12A
ELL2P1
ABCC1
LCMT2
LINC00957
EPHA8
PDAP1
MRPS7
SNX31
IGFBP5
RPL35AP16
PCDH12
GRK6P1
UPK1B
GAPDHP26
AFAP1L1
RPS10P7
MARK3P3
MARCHF1
RFX3
HNRNPRP1
TENM3
GSG1
TRAPPC1
GAPDHP45
EIF1P3
RNU6-914P
PRDX3P1
CGNL1
TSPAN18
CHKB-DT
LBX2
DNAH3
PRR22
ATP4B
DNMT1
AKR1C3
LINC00705
CRHR2
MRPL23-AS1
MIR4658
CLIP2
RXRG
SNX18
GGT5
NEDD8
MIR6875
VGF
CCDC9B
NACA
AARS1
IGHG2
ZBTB32
DLL3
ZRANB2-AS2
LAMB2P1
HLA-J
DACH1
TOR3A
ICAM3
PFDN4
DUOX1
MPPED2
HABP2
NRAP
KAT6B
ENHO
GBAP1
ANGPT4
EBF3
MAPK6P4
MLXP1
GRIK5
ZMAT3
CEACAM8
SEMA6D
PDZK1P1
SMIM10L2B-AS1
GALNT5
LIPK
CICP4
AMER2
SPRY3
FAR2P2
FAM219A
ZFP2
DPF3
SCGB1B2P
PRDM11
RPL34P18
ADRB2
ACE
WNT11
LINC01143
KCND1
DENND5A
CNTNAP5
KIF20A
KNTC1
SNORD35A
UCA1
FEM1C
ERICH2
BRI3
TBX15
NEURL2
LCP2
KCTD21-AS1
POFUT2
UBA52P7
DSN1
RSRC2
PARP6
GOLGA6L4
RPL22P2
SEMA5B
HS3ST5
ABHD6
CSPG4P12
MVD
SPEF1
ZBTB8OSP2
TIPARP
KIF18A
CD2AP
MIR193A
SNTG2-AS1
POTEJ
TCIM
HCG4P8
GFI1
RNF165
SRA1
ZNF725P
PLA2G4F
TMEM156
FRG1EP
SHH
CD3E
LINC00501
ZNF723
FTH1P13
SCGB2A2
PCDHA4
FLT1
RASA4CP
SLITRK4
SDHDP6
SNORD117
SETP10
SNORA9
PDE6B
MAML2
HOTTIP
IFIT1
SYT3
PEX11G
WNT9A
LBP
PAFAH1B2P1
CNTN3
RCAN2
SEC62-AS1
DISP2
COX7A2P2
SIAH2-AS1
CKS1BP1
SPRY2
PC
MIR6814
OR51B5
NR3C2
ORC1
RPL12P13
SOWAHD
RPF2P1
FTH1P23
GAPDHP28
TSFM
PSMC5
ITGA2B
ZNF17
CCDC40
MIR6876
GLRX3P2
PTGER3
CREB3L2
SH3BP1
FNDC4
TLE2
TGM1
PCDH8
PDZD2
GTF3C6P2
UBE2CP4
ADCY7
VTN
LENG9
BNIP3P10
KIAA0930
FAUP2
CEMP1
ZC3H6
BNIP3P11
PDPN
CTNNA1P1
LY96
RPSAP14
WBP1LP2
RNU6-1055P
NIM1K
GPR87
MIR6510
RPL23AP8
MIR936
FZD9
ZNF74
USP8P1
KLB
KAT5
LINC01772
CLDND2
GPD1
ALDH2
TUFMP1
IRF1-AS1
GATA3-AS1
ANKRD49P2
ACACB
COL5A3
KCNMB1
RPL21P8
AGAP1-IT1
ZNF727
RPGRIP1
LINC00519
DSEL-AS1
PCDHAC1
MAP6
MYT1
MED10
PHF24
SLC30A6-DT
MMP1
LINC02485
PGAM4
PITPNM3
AOX2P
RAET1E-AS1
LINC00323
SAV1
MRTFA-AS1
RNU6-436P
FBXO30-DT
PLCB2
PLEKHH2
RPL32P20
CNNM1
HECW2-AS1
HOPX
RPL17P36
RPL39P3
RASSF4
LINC01637
ZNF793
MIR6763
MMP2
LINC00365
ESR1
WNT5A-AS1
LINC01409
PTMAP12
KCTD12
TMEM171
RPL21P89
MCF2
LINC01094
KCNV2
OR1L8
RAMP2-AS1
PRSS3
SLAMF8
PDE4C
SLC17A5
SEPTIN9-DT
SNCA
FOXI1
SMILR
PTPN21
EEF1A1P9
SMIM35
PCSK9
PTCHD3
SH3TC2-DT
CCDC106
CEL
TMEM230P1
S100A8
MT1E
GABARAPL3
RASSF10-DT
PTBP1P
PAICSP1
LINC00539
SCARNA12
DSG4
TCN1
ROR2
WDR62
LINC00276
USP54
HNRNPM
EPX
IL2RG
TP73
PSMD10P2
LINC01152
NSG2
PRSS21
LINC00239
ZNF625
C1orf158
PSMB8
SNRPCP3
CD101
PBK
LINC01697
NACAP2
SLC25A24P1
CDC42P5
MAST1
RPL7P44
LHFPL6
WWOX
RPS27AP6
RNA5SP260
CCL28
MIR583HG
IL6-AS1
C16orf86
MYO3B
ZXDB
CNGB1
TMSB10P1
OR2A42
MIR937
SLC25A38P1
IMPDH1P9
TMEM229A
KLHDC7B
MECP2
NAV2-AS2
C11orf94
MIR3654
ZNF804B
SH3BP4
MFF-DT
BRPF3-AS1
PARVB
RDM1
LGALS1
SETP8
BHMT
MIX23P5
CCDC60
TBXA2R
LINC02157
LINC00115
HEATR4
TPT1P6
CCDC17
IL17RD
ACTG1P15
LINC00894
DYRK3
RNF157-AS1
TTC3-AS1
RSAD2
RPL15P2
ANAPC10P1
MIF4GD-DT
ZBED6CL
MED14OS
PTMAP11
MZB1
RSKR
ZNF551
GPAT4-AS1
CKMT2
MIR3918
RSL24D1
SNX19P3
SQLE-DT
LINC01424
GPRC5D-AS1
SMCR5
GAPDHP2
ZNF702P
FKBP6
LINC01535
TROAP
NAA20P1
EEF1DP8
SAP18P2
ZNF391
MIR27B
LINC01356
RPS2P24
KRT6A
TF
BIRC3
NOXO1
EPHX4
CPB2-AS1
FOXB1
CCDC184
DSCAML1
MIR7706
LINC01892
MIR6746
IGSF9B
GLYCTK-AS1
GAB2
TAF7L
HMSD
GFRA3
PAEP
LINC01285
GSEC
IDSP1
HNF1A
PDZD4
F2R
MARCHF5
UNC93B3
FAM124A
ARMC10P1
SUGT1P4
CRYM
TAS2R31
ST13P15
ARL5AP5
PTP4A1P4
HS3ST3A1
RNVU1-19
SV2C
SOHLH1
MAPRE2
ACTG2
SFMBT2
HYI
SCX
RPL24P2
PTX3
KIF21B
MIR4434
CCNYL7
RPL7P8
RNA5SP221
LINC01425
CHRFAM7A
NHLRC1
WNT4
SF3B4P1
NBEAP6
RPSAP26
MIR215
MEX3B
LETR1
ZSCAN18
PRDM16
MAST3
EEF1A1P12
PRKG2
IL1R2
FANCE
CDH5
RHOT1P3
MTRNR2L8
XIAPP1
BRI3BP
DPYSL5
CDCA3
EPAS1
LINC02506
MYADM
CRMP1
ARHGAP42-AS1
ACTG1P9
CFHR5
SUSD3
OR8B10P
NT5CP1
POU5F1B
PRNCR1
MIR4740
SRP9P1
DYSF
ATP5MKP1
TUBB2BP1
ADAM29
EHD4-AS1
ZFHX2
AGXT
PLAC4
NPM1P46
CRISPLD1
HOXA5
TNFRSF14
MIR21
EID2B
ADTRP
CIT
RAB42
PTPRB
SDSL
RN7SL535P
ZNF114-AS1
PTTG3P
MMP11
KRT8P1
ERVK-28
NEAT1
FDPSP4
RPS6KA6
RBM22P2
ITPRIP
LINC02680
C1orf216
FDPSP7
PTPRD
RN7SL659P
MIR3190
RNU6-163P
C21orf62
SEC14L1P1
ADRA1B
RTEL1
TTC23L-AS1
GLS2
CALN1
TGM3
CCN6
ZNF577
WDR77
RPL21P44
PTPRM
SOSTDC1
SYDE1
PRDX2P1
KANSL1L-AS1
BPIFA4P
FAM95C
SOBP
LINC00621
STAB2
BACE2
MIR3187
EMSLR
LINC02318
DUTP6
UBE2R2-AS1
SLC7A1
FRG1-DT
ADGRD1
RNA5SP343
MAG
ZNF25
MIR5196
MIR6834
PNMT
RPL23AP52
RPL35AP2
SNORA25
TRAF6P1
HIGD1AP14
ARMH1
DLGAP4
LINC01508
SCUBE1
LRMDA
CDC20P1
FBXL2
OR7E29P
RNU6-780P
FCF1P1
GLRB
ALG8
IL6
CAVIN3
MLPH
LINC02178
POTEF
LINC00572
ATOH8
NLGN1
HORMAD2-AS1
EMILIN2
NLRP2B
SHBG
FUT5
GJA1P1
PIEZO2
SPINK2
SLC12A8
CAPN9
MYCL
DDX3Y
SAMSN1
CFTR
GPR161
KRT17P6
TOMM20L-DT
KCNG2
TEX44
CDK8P1
HCG4B
ATP6V1E1P1
ASB14
FRG1KP
ANKRD7
ATP5PBP2
ASS1P8
MIAT
MN1
BMPR1B
AOX1
CHP1P3
ZNF462
PTPRVP
DNAI4
ACAD8
SNORA60
ALG1L13P
CATSPERE
EIF4A2P1
GAPDHS
CMAHP
KLK10
RN7SKP30
LINC00350
SLC35E1
IFITM3P2
ABCA10
LHX1
MIR1260B
CYP2C8
PGAM1P7
BRAFP1
ITGA9
CRB2
CHRNA7
RPS15AP12
NUP50P1
ARHGEF35
MAP3K7CL
KPNA4P1
HYKK
FCGR2B
TRIML2
TNRC6B-DT
UBR5-DT
TMEM130
SOX21-AS1
BMS1P22
TLR3
RPL13AP23
LINC02226
RAB28P5
BDKRB2
RN7SL130P
FRG1FP
CHKA-DT
RNU4-22P
NDUFB2
NDUFAB1P1
TEX53
SLC25A48
ABCB4
KRTAP10-2
HRH1
RPL6P25
RBM22P4
EGFLAM-AS1
PPP1R2B
CYCSP24
GABPB1
RNU6-957P
RAD21P1
ROM1
IGHG4
PDCD6IPP2
SALL2
CPP
ELOVL3
ADAMTS6
FAM3B
COX20P2
MTND5P26
NASPP1
LINC00589
ZNF132-DT
EYS
RPS19P7
PTGES2
LINC02600
MRPS11
PRKCZ-AS1
PLEKHO2
MIR16-1
MTATP8P1
DNAAF4
ABI1
SEPHS2
UGP2
SUSD2
TSSK2
MIR6823
CARS1
CAMP
SERPINA6
BDKRB1
LINC00845
TMEM178A
APBA2
IBSP
RN7SKP56
CTBP2P3
ISM1-AS1
RPL12P28
FGF7
ADGRG3
NEXMIF
RNU6-319P
SPATA4
NBPF20
RPL36P4
GPC2
ABLIM1
JPH1
MIR3960
OR5M3
ST8SIA6
LINC02641
ARF1P1
NPM1P24
MIR6838
IGHEP1
CTRB2
MYLK-AS1
VPS26BP1
MYOG
FBN1
SRSF3P5
RAP1AP
CROCCP4
SPDYE21
FOXN1
ATP5PBP7
TPI1P4
ZBTB39
FAM183A
ADH4
PLA2G1B
ELN
GNE
EEF1A1P29
RPL22P24
CD207
MIR146B
LINC02280
LINC02055
PLP1
MIR4482
MRPS5P3
LINC02888
TRAV29DV5
CATSPERZ
HMGA2-AS1
TINAGL1
MIR6506
LCE1B
BCAP31P2
COX5AP2
MIR1279
CSRP3-AS1
LINC02012
MIR6779
TRBV20OR9-2
RPL8P2
OPN3
HCAR2
VSIG1
LDLRAD4-AS1
TDRP
LIPE
MIX23P3
TSPY26P
GLULP4
SCHIP1
MTMR9LP
CCNI2
CLPS
DLGAP5
TOLLIP-DT
SMIM6
EDA
LINC01686
ADAMTS7
SMCO2
RN7SKP116
H1-12P
KLF7P1
FNTAP1
MIR3609
LINC02518
NAV2-AS3
RASA3-IT1
MTX3
OR8A3P
MPC1-DT
ZNF827
LINC00634
BMS1P15
YWHAZP2
HAL
RPL3P8
PRTN3
PDE10A
TTLL1-AS1
UMODL1-AS1
OR10D3
RPS4XP8
ARHGAP29
SH2D5
COPS8P2
MIR6075
RPS26P41
KCNG4
CEP126
MGAT4EP
SLC2A3P4
MKI67
TMPRSS7
RNA5SP283
KCNJ6
PROKR1
YPEL5P2
MSN
RN7SL431P
SPEF2
TGIF1P1
AKAP12
GRM6
SLC6A16
CHRNE
RPL18AP15
GATA6-AS1
BACH1-IT1
LINC01441
CAMK2D
LINC01134
SLC5A5
MAFTRR
HMGN1P35
GPR37L1
MIR6844
NELL1
GJA1
MRAP-AS1
MESP2
ALMS1-IT1
GRXCR2
SPIRE1
GSTP1
CYP4Z1
KRT8P43
SLC52A3
CBX5P1
MIR4690
TSSK3
TXNP4
FOXD2-AS1
DAPK1
C16orf92
PLCD4
TCEAL8
PPIL1
MANBA
LINC01747
DNM3
PRICKLE2-AS3
CCDC110
HOMER2
NPIPA9
MIR6790
TMSB15B-AS1
IFI6
ZNF419
SYT11
LINC02851
SNTG1
HCLS1
UBASH3A
OR8G5
HLA-DQB2
KCTD5P1
GSDMD
NRN1L
GAB3
EIF3IP1
RNF222
SLC22A13
CLRN1-AS1
GNG10P1
HSP90AA4P
CDHR4
EXTL3-AS1
PSMC1P8
MIR5188
P2RY1
EIF3LP1
TMTC2
KLF3P1
F7
SV2B
OR8T1P
RNF20
ANKRD11P2
DDX59-AS1
OPN1SW
LINC01366
NLRP3P1
LINC00534
SEPTIN7P8
PHBP7
RNU6-883P
GAPDHP67
RRN3P2
CHI3L1
OXCT1
MFAP4
BET1
RPS2P2
HYI-AS1
IDH1-AS1
PINCR
PAQR8
ZNF460-AS1
MIRLET7F1
PSMC1P11
H2BC18
ALDH1A3-AS1
GAPDHP48
ZNF649
PHF2P2
PPARGC1A
ANP32BP1
ADAMTS2
RNU6-418P
MAP3K2-DT
AATBC
RNA5SP439
HMGN2P38
FAM3D
RTCA-AS1
HIC2
UGT1A12P
FHAD1
PCOLCE2
LINC00858
HS3ST6
MAPK8IP2
TAPT1-AS1
SLC1A6
LINC00664
RPL21P41
INPP5J
SCARNA3
MTND4LP30
HLA-DRB9
STX7
PRB3
VDAC1P7
TONSL-AS1
TLR6
SF3A3P1
SHOX2
MIR637
LINC01397
OR8B2
RN7SL743P
MIR193B
HAUS6P1
PTGS1
ZNF320
LINC00266-1
MRPS31P2
SF3A3P2
LEFTY1
SYNPR-AS1
RN7SL164P
ALOX12B
MIR421
MT-TV
HERC2P3
CNN2P12
DNAI3
IMPDH1P2
MIR4523
MIR4675
SNORD34
RPS23P1
HENMT1
GNRH1
C5AR2
ARX
LUADT1
RPS5P2
SLCO1A2
GDAP1L1
NADK2-AS1
SLC6A19
HBQ1
LRP1
HMGN2P10
PLAC1
ANKRD49P1
RPL36AP45
MIR6872
MAGEE1
CCDC200
CBX3P1
CALCB
LINP1
RPL32P16
PRL
PBX1-AS1
MTHFD2P7
FENDRR
FOXD3-AS1
RPL22P1
MIR193BHG
FNDC3CP
RNF213-AS1
ARHGEF18-AS1
ZNF221
EVX1
ROBO3
SNORA50A
RBMS1P1
GOLGA8H
MIR6836
LINC02895
GPR55
KRTAP1-3
TNNC2
APOB
PCNPP3
AFTPH-DT
ATP5F1EP2
EEF1A1P2
F8A3
HCG27
LINC02816
VN1R83P
BHLHE41
APLF
SERPINA4
MMP21
MACROD2-IT1
TMEM132E
LBX1-AS1
BNC2-AS1
OXGR1
HTR5A
RNU6-460P
GTF2IRD2P1
CHST9
ZBBX
LINC02019
NPR3
LINC01311
PRSS29P
KRT8P4
DSC1
KAT7P1
RNVU1-2A
ANO7L1
RPS26P15
PRKN
INSC
HPCAL4
CAHM
SLC12A4
COX6CP2
ZDHHC1
MBLAC1
CORO1A
MYL12BP2
CASS4
MTND4LP7
RN7SL89P
LINC00997
ZNF517
LRIG2
EPB41L4A-AS1
GUCY1B1
ACTR1AP1
PRRT4
LINC02443
ACTBP12
ANAPC1P2
PDE4DIPP7
NACA2
PRIM1
H2AZP1
ARHGAP26
TMEM145
KCNQ4
CCDC181
RPSAP6
RNA5SP437
MIR2110
RNFT1P3
SLC4A1
SNORD36B
MTND5P1
ADAM11
EDIL3-DT
ANKRD18B
TMPRSS11A
SMAD5
ZCCHC18
MBTPS1-DT
NRSN2-AS1
ZSCAN5C
DEFB1
DIAPH2-AS1
HOXB6
MIR4284
CFAP69
HNRNPA1P46
CCDC152
IL21R
IL21R-AS1
ANKRD20A19P
GRIA3
CCNJP2
CORO2B
MIR181B2
NOS2
THSD8
PTCH2
NIFKP4
NCMAP
ACTBP7
NME5
RNU6-1285P
TTC4P1
PMS2P11
FAM43B
GVINP1
MEF2C-AS1
MEGF10
FAM166C
PTCHD3P2
TRABD2B
KCNMB2
IGF1
RPL7P58
ROCR
VGLL1
ACTP1
BMP8A
ASTN2
LRFN2
CNTNAP3C
BCAS2P1
CICP13
LINC02463
ZNF658
TXNDC8
ABHD14A-ACY1
CDH17
DYNAP
LONRF3
LINC01091
PNPLA1
GCATP1
GNMT
SEC61G
SBK2
AOC2
TMEM169
ELAVL2
RTKN
CHID1
SLC4A1APP1
PICART1
PDC-AS1
CLDN14
SNORA63D
FBLN2
RPL23AP12
PDCL3P2
PTTG2
ADORA3
ARHGAP31
RNY3P15
DYNLT3P2
LIG1
ZFPM2-AS1
SELENOP
FBLN7
P2RX5
SPRY4
MIR6859-1
CSTA
JMY
HCAR3
CGB3
KRT18P6
USP51
WASIR1
ACER2P1
MIR365A
CSMD2
ENPP7P7
RNU4-78P
CHST1
LINC00648
LINC01361
IQCN
MIR7851
C1QTNF1
SPATA45
PLCL2
FAM114A1
GATA1
CTBP2P8
ATP13A4
RPS17P5
PPP1R2
FYB1
RBMXP3
RNU6-481P
C16orf96
CALM2P3
NEXN
ZXDA
TPRKBP2
DHX58
IL1A
C20orf144
C19orf71
MIR1234
SLC38A3
LINC02904
PPIAP31
RPL21P135
SASH1
U2AF1L5
NPAS2-AS1
RSPO1
POU3F2
C8orf74
FRMPD1
LINC00942
KRT18P40
MIR600
DSEL
RMDN2-AS1
RNU6-455P
AGGF1P1
GAPDHP24
MT1L
LINC01907
CD4
PZP
SMPD4P1
EPCAM-DT
UBE2Q2L
NCF2
PAX7
IPO8P1
CCDC160
AKR1B1
KCNH6
RPS4XP19
RPL22P16
LINC02615
BOD1L1
DUTP7
RPS29P7
INSL6
AQP7
MIR3189
EVPLL
SLC19A3
RPS3AP29
LEF1
RPS17P1
TRAV27
MSLN
TRIM34
ICMT
HAS2
SNORD38A
TNKS
LINC02694
STX8P1
ST6GALNAC4
NME2P2
ARPP21
GRASLND
PAX2
RFTN1
VSTM2A
CTRB1
SCARNA1
PIH1D2
FAM13C
PLPPR3
PRDX3P2
TMEM190
HMCN2
RNU6-1280P
KRTDAP
SNORA79B
PSMD7P1
PRKY
APOOP2
CCL26
YBX1P10
PTAFR
ZNF441
FAM87B
TUBAP4
S100A3
GNG8
TAS2R13
SERPINA9
PPIAP85
ZBTB46
RPL31P63
LYPLA2P1
BLZF2P
EXOC3L2
SLC2A7
GASAL1
CENPF
NKX2-1
C9orf57
OR6K4P
PDGFRB
CTSLP2
FOXQ1
SERHL2
CATSPER1
KLF2P1
PHF3
TG
CCL4L2
CNTNAP3B
LINC00955
MIR1825
GAPDHP23
RPL10AP2
RBMX2P3
C1QTNF3
PNPO
NFYCP2
PPIAP40
MUC4
XKR7
KCNQ2
KIAA1210
RPL32P6
TMEM266
GALNT15
RPS15AP6
ZNF532
MIR4720
RPL21P93
SHISAL2A
KRT18P56
SPSB3
JAM2
SUMO2P1
FOXP1-AS1
INCA1
C20orf27
NAT8B
SARM1
ST3GAL1-DT
SEC14L5
MAGEC3
SHLD2P3
HMGN1P8
COL4A2
LINC00460
MIR3139
MYO1G
LINC02595
C1QL1
MIR155
MYBPC1
CDCP1
SFTPA1
ABHD12B
MYO7A
RPL13AP2
POLG-DT
KLK4
SPINK5
SLC9A9
DIS3L-AS1
C5orf46
RPL19P20
CNTN2
TSPOAP1
LINC01338
TRPM2
LINC00167
FBXL19
LINC00840
NBEAP1
KCNT1
GUCA1A
GPHA2
SRMP2
NMD3P1
KIAA1217
CYP2T3P
AJAP1
APOBEC3B
SPAG16
BEAN1
OR7E22P
CYP3A7
CYP3A7-CYP3A51P
ZDHHC22
LINC02335
SLN
ITGA6
ENTPD8
FOXA3
OR52K3P
KRTAP9-12P
RPL36P2
RPS3AP26
TPBGL
SIRT4
LRRC4C
LINC01238
C22orf23
TPI1P2
LINC01186
RN7SL354P
CARNMT1-AS1
NMRK2
RCC2P6
ZNF571-AS1
SEPHS1P6
AP1M2P1
CDC42-IT1
UFM1P2
SCN3B
PKNOX2
APOBEC3G
IRAK2
GALNT16
AGO4
POTEG
LINC00626
WFDC3
MYOM1
CBX3P2
ZWINT
EEF1A1P1
OR10AC1
LIPM
RPL37P2
YPEL4
TCAF2C
PIGHP1
TBCAP1
MT-TG
C1GALT1C1L
BEX1
C1QL4
DUSP5-DT
KRT15
CMPK2
ADRA2B
CXCL8
COP1P1
SMYD3-AS1
ODF3
VSTM4
BTF3L4P1
ARMC3
SEMA7A
MIR1972-1
RNU2-27P
PRKCQ
RPL32P27
RNA5SP141
HLA-DMB
MIR3621
ITPRIP-AS1
P3H4
NCR3
LINC01228
LINC00494
ESYT3
EEF1A1P11
PTGIS
RSL24D1P1
CHMP5P1
EGR2
PTPRC
LINC01114
HOXD8
RNY1P15
KIAA0408
TFGP1
PPP4R1-AS1
ACTG1P3
LINC01933
CCL3
TUBBP2
FRMD5
SGCD
ARPP19P1
MIR6740
PEG10
HMGB1P3
RPSAP69
RSL24D1P6
SUMO2P6
MIR5006
TNIP1
SNHG28
RNA5SP37
RBM11
PRKAG2-AS1
RN7SL775P
IL11RA
LINC01305
ATP6V0E1P3
RN7SL4P
CRBN
MON1A
CCR2
SLC6A20
LINC02533
LINC01362
COL7A1
SNORD3B-1
DEPDC1P1
RASAL2-AS1
SNORD54
ACSM4
OR7E90P
H3P47
SETP22
VEGFD
GPBAR1
RN7SL466P
ABCB10
SCML2P1
ATP6V0E1P2
C1orf94
GCM2
SDR9C7
MAS1
FNDC7
NACAD
IFFO1
SPANXB1
PTMAP1
LINC02300
SRCIN1
OGFRP1
TMEM121B
CATSPER3
LINC01978
RPS8P4
EVI2B
HES7
ZFP37
ALDH3B1
MIR544B
RPL7P9
KLHL38
RNU1-134P
RN7SL443P
G0S2
SLC7A9
PCSK1
DIRAS3
MIR23A
FAM157A
UPK3A
SLC9A7P1
RHEX
FLNC
SNORA20
KRT8P27
UQCRBP2
DNAJC28
WWP1P1
SNORD52
CLLU1
MIR4513
DDX12P
HSPA2-AS1
CCND2-AS1
CCND2
RPL26P30
TNFAIP8
RGMA
ARHGAP44-AS1
MIR548O
MIR933
MIR6165
ENPP2
RNU7-40P
LINC02679
BRWD1-AS1
MIR34A
NOTO
SNORD70B
SEPTIN7P7
MYBL2
LRIG2-DT
RPP25
MIR30B
ZNF826P
RDM1P1
MIR6810
POLH-AS1
FZD1
RPL12P47
RPS7P14
RNU6-29P
C1GALT1
BZW1P2
RPL13AP7
PRAM1
EIF2S2P4
RBPMS2
SOX10
LINC00640
FAM133FP
FAM217A
LINC01068
LINC01864
MTATP8P2
ITGB1
HLA-DRB1
HSPA8P16
KLHDC7B-DT
ST18
LINC02223
COX6B1P4
HNRNPA1P47
NT5M
OR7E37P
MIS18A-AS1
LINC02269
SLC4A9
ADCY5
MYCNUT
IL17REL
IGHV4-34
MAD2L1-DT
H3P11
RPL31P7
NLRP3
IGSF22
HMGA1P7
KRT85
KCNC2
SLC25A27
LST1
CICP9
TNFAIP6
FGG
LYG2
FABP6-AS1
NOG
RP9
CLDN11
ANGPTL2
CSF3R
LINC01749
PRKAR2B-AS1
LINC00608
VAX1
RPL23AP35
CALCA
DBIL5P2
LYPLA1P3
MEAF6P1
ZMYND10
SLC8A3
DLG5-AS1
PDE1A
TRIM67
MEDAG
ITPRID1
YY2
RN7SL166P
UBE2S
TBPL2
CENPK
TMCO2
MMP10
KCTD9P1
WDHD1
SNORA73B
MEFV
PSMD8P1
YIPF7
MINAR2
ABCC6P2
ISOC2
TXNP5
PLAT
JAG1
LINC01185
TTYH2
CGB7
LINC02068
LINC01701
CALHM3
RPL37A-DT
ME3
CNTNAP3P1
ITGA6-AS1
PIGM
RPL7AP11
SERHL
LINC02052
NIFKP8
ACTN3
C20orf202
MAPK4
UROC1
OLFML2A
RN7SL253P
NFYBP1
HHIP-AS1
DKKL1
LINC00865

Unsupervised Analysis¶

In this section we want to explore the dataset using unsupervised learning techniques.
We will use the dataset Train and we will carry out our analysis focusing on MCF7 - Smart Seq, adding remarks and considerations on HCC1806 when needed. Applying the logarithm to our data will be helpful in its visualization.

PCA¶

Reducing the dimension of the dataset is useful for some reasons:

  • for visualization purposes;
  • to understand how the variance is distributed among features;
  • to visualize the results of clustering and perform clustering on the highest variance features found with PCA.

Cells¶

We start by performing PCA on cells' features: we try to reduce the number of genes (dimensions) while trying to maintain the ones with the highest values of variance (we take 95% as a threshold). PCA is performed on the original dataset, with no transformation applied.

In [ ]:
PCA_data = PCA(n_components=0.95)
data_train_red = PCA_data.fit_transform(data_train)
print("Number of components:", PCA_data.n_components_)
print('Explained variation per principal component: {}'.format(PCA_data.explained_variance_ratio_))
Number of components: 20
Explained variation per principal component: [0.6344835  0.09107496 0.06270155 0.04033215 0.03156572 0.01538185
 0.01138705 0.01009336 0.00898676 0.00758287 0.00614321 0.00492391
 0.00467293 0.00430685 0.00377886 0.00331298 0.00300167 0.0027689
 0.00266241 0.00228027]
In [ ]:
print("Reconstruction error:", mean_squared_error(PCA_data.inverse_transform(data_train_red), data_train))
Reconstruction error: 24488.13749442456

For HCC1806 experiment the results are similar: there are 35 PC instead of 20, and the reconstruction error is quite high.

In [ ]:
PCA_data = PCA(n_components=0.95)
data_train_red = PCA_data.fit_transform(data_train)
print("Number of components:", PCA_data.n_components_)
print('Explained variation per principal component: {}'.format(PCA_data.explained_variance_ratio_))
Number of components: 34
Explained variation per principal component: [0.29018923 0.18101256 0.12288734 0.07970126 0.04956884 0.03640102
 0.02737402 0.02113149 0.01743474 0.01317286 0.01208516 0.0111941
 0.00880007 0.00783661 0.00748447 0.00693555 0.00579659 0.00509238
 0.00467221 0.00422117 0.00399608 0.00383594 0.00365575 0.00342266
 0.00304432 0.00285331 0.00257095 0.00235782 0.0022066  0.00215775
 0.00209672 0.00204413 0.00186566 0.00174111]
In [ ]:
print("Reconstruction error:", mean_squared_error(PCA_data.inverse_transform(data_train_red), data_train))
Reconstruction error: 17554.361399621102

We would also like to see how much the first components are insightful as part of the variance, and how the variance varies as we increase the number of components.

In [ ]:
exp_var_pca = PCA_data.explained_variance_ratio_
v = len(PCA_data.explained_variance_ratio_)
cum_sum_eigenvalues = np.cumsum(exp_var_pca)
plt.bar(range(0,len(exp_var_pca)), exp_var_pca, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(cum_sum_eigenvalues)), cum_sum_eigenvalues, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
legend = plt.legend(loc='best', frameon=True)
legend.get_frame().set_edgecolor('black')
plt.tight_layout()
plt.title("MCF7 - PCA")
plt.grid(visible=False)
plt.xticks([i for i in range(v)], [i+1 for i in range(v)])
plt.show()

It's interesting to see that the first component explains more than 60% of the variance and the second one is significantly lower. This is not true for HCC1806, where the first component is reponsible for only 29% of the variance but the decrease with the second component is less dramatic.

In [ ]:
exp_var_pca = PCA_data.explained_variance_ratio_
v = len(PCA_data.explained_variance_ratio_)
cum_sum_eigenvalues = np.cumsum(exp_var_pca)
plt.bar(range(0,len(exp_var_pca)), exp_var_pca, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(cum_sum_eigenvalues)), cum_sum_eigenvalues, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
legend = plt.legend(loc='best', frameon=True)
legend.get_frame().set_edgecolor('black')
plt.tight_layout()
plt.title("HCC1806 - PCA")
plt.grid(visible=False)
plt.xticks([i for i in range(v)], [i+1 for i in range(v)])
plt.show()

Let's visualize the first five components plotted against each other: we can see that distribution of the cells is quite different.

In [ ]:
n_components = 5
pca = PCA(n_components=n_components)
components = pca.fit_transform(data_train)

total_var = pca.explained_variance_ratio_.sum() * 100

labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'Condition'

fig = px.scatter_matrix(
    components,
    color=data_train_lab["Condition"],
    dimensions=range(n_components),
    labels=labels,
    title=f'MCF7 - Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)
fig.show()
In [ ]:
n_components = 5
pca = PCA(n_components=n_components)
components = pca.fit_transform(data_train)

total_var = pca.explained_variance_ratio_.sum() * 100

labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'Condition'

fig = px.scatter_matrix(
    components,
    color=data_train_lab["Condition"],
    dimensions=range(n_components),
    labels=labels,
    title=f'HCC1806 - Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)
fig.show()

For visualization purposes, we now set the number of components to 2 and then to 3. Starting from the reduced dataset, we plot each datapoint (cell) in green if it is from Hypoxia environment and in red if it is from Normoxia environment.

In 2D¶

In [ ]:
PCA2_data = PCA(n_components=2)
principalComponents_hcc2 = PCA2_data.fit_transform(data_train)
data_pr2 = pd.DataFrame(data = principalComponents_hcc2
             , columns = ['PC1', 'PC2'])
print('Explained variation per principal component: {}'.format(PCA2_data.explained_variance_ratio_)) 
Explained variation per principal component: [0.6344835  0.09107496]
In [ ]:
data_pr2_lab = data_pr2.copy()
data_pr2_lab["Condition"] = [i for i in range(n)]
for i in range(n):
    if (data_train_lab["Condition"][i] == "Norm"):
        data_pr2_lab["Condition"][i] = 0
    elif (data_train_lab["Condition"][i] == "Hypo"):
        data_pr2_lab["Condition"][i] = 1
/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35472/721786036.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35472/721786036.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [ ]:
x = np.array(data_pr2_lab['PC1'])
y = np.array(data_pr2_lab['PC2'])
plt.scatter(x, y, c=data_pr2_lab["Condition"], cmap="prism")
plt.title("MCF7 - PCA of cells")
plt.xlabel('PC1')
plt.ylabel('PC2')

plt.show()

For HCC1806:

In [ ]:
PCA2_data = PCA(n_components=2)
principalComponents_hcc2 = PCA2_data.fit_transform(data_train)
data_pr2 = pd.DataFrame(data = principalComponents_hcc2
             , columns = ['PC1', 'PC2'])
print('Explained variation per principal component: {}'.format(PCA2_data.explained_variance_ratio_)) 
Explained variation per principal component: [0.29018923 0.18101256]
In [ ]:
data_pr2_lab = data_pr2.copy()
data_pr2_lab["Condition"] = [i for i in range(n)]
for i in range(n):
    if (data_train_lab["Condition"][i] == "Normo"):
        data_pr2_lab["Condition"][i] = 0
    elif (data_train_lab["Condition"][i] == "Hypo"):
        data_pr2_lab["Condition"][i] = 1
/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35463/3019903764.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35463/3019903764.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [ ]:
x = np.array(data_pr2_lab['PC1'])
y = np.array(data_pr2_lab['PC2'])
plt.scatter(x, y, c=data_pr2_lab["Condition"], cmap="prism")
plt.title("HCC1806 - PCA of cells")
plt.xlabel('PC1')
plt.ylabel('PC2')

plt.show()

As already noticed, the cells are distributed in quite a different way.
In 2D normoxic and hypoxic cells in MCF7 seems clearly divided, while it is not the case for HCC1806.

In 3D¶

In [ ]:
PCA3_hcc = PCA(n_components=3)
principalComponents_hcc3 = PCA3_hcc.fit_transform(data_train)
data_pr3 = pd.DataFrame(data = principalComponents_hcc3
             , columns = ['PC1', 'PC2', 'PC3'])
print('Explained variation per principal component: {}'.format(PCA3_hcc.explained_variance_ratio_)) 
Explained variation per principal component: [0.6344835  0.09107496 0.06270155]
In [ ]:
data_pr3_lab = data_pr3.copy()
data_pr3_lab["Condition"] = [i for i in range(n)]
for i in range(n):
    if (data_train_lab["Condition"][i] == "Norm"):
        data_pr3_lab["Condition"][i] = 0
    elif (data_train_lab["Condition"][i] == "Hypo"):
        data_pr3_lab["Condition"][i] = 1
/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35472/1474751301.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35472/1474751301.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [ ]:
def PCA_3(EL, AZ):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(projection='3d')
     
    x = np.array(data_pr3_lab['PC1'])
    y = np.array(data_pr3_lab['PC2'])
    z = np.array(data_pr3_lab['PC3'])
    scatter = ax.scatter(x, y, z, c=data_pr3_lab["Condition"], cmap="prism") 

    labels = ["Normoxia", "Hypoxia"]
    legend_handles, legend_labels = scatter.legend_elements()
    legend = ax.legend(handles=legend_handles, labels=labels, loc='center left', bbox_to_anchor=(0, 0.8))

    ax.view_init(elev=EL, azim=AZ)
    print("Elevation:",EL," Azimut:",AZ)
In [ ]:
PCA_3(20,120)
Elevation: 20  Azimut: 120
In [ ]:
PCA3_hcc = PCA(n_components=3)
principalComponents_hcc3 = PCA3_hcc.fit_transform(data_train)
data_pr3 = pd.DataFrame(data = principalComponents_hcc3
             , columns = ['PC1', 'PC2', 'PC3'])
print('Explained variation per principal component: {}'.format(PCA3_hcc.explained_variance_ratio_)) 
Explained variation per principal component: [0.29018923 0.18101256 0.12288734]
In [ ]:
data_pr3_lab = data_pr3.copy()
data_pr3_lab["Condition"] = [i for i in range(n)]
for i in range(n):
    if (data_train_lab["Condition"][i] == "Normo"):
        data_pr3_lab["Condition"][i] = 0
    elif (data_train_lab["Condition"][i] == "Hypo"):
        data_pr3_lab["Condition"][i] = 1
/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35463/1377304862.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35463/1377304862.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [ ]:
def PCA_3(EL, AZ):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(projection='3d')
     
    x = np.array(data_pr3_lab['PC1'])
    y = np.array(data_pr3_lab['PC2'])
    z = np.array(data_pr3_lab['PC3'])
    scatter = ax.scatter(x, y, z, c=data_pr3_lab["Condition"], cmap="prism") 

    labels = ["Normoxia", "Hypoxia"]
    legend_handles, legend_labels = scatter.legend_elements()
    legend = ax.legend(handles=legend_handles, labels=labels, loc='center left', bbox_to_anchor=(0, 0.8))

    ax.view_init(elev=EL, azim=AZ)
    print("Elevation:",EL," Azimut:",AZ)
In [ ]:
PCA_3(20,120)
Elevation: 20  Azimut: 120

In 3D cells of HCC1806 are better separated: this makes sense, as the variance explained by the third principal component is roughly similar to the variance explained by the second one (18% and 12%), so also the third component is relevant.

Genes¶

PCA on genes is mainly done for visualization and later for clustering. We are not interested in reducing the dimensions per se, as it is not so relevant from a biological point of view.

In [ ]:
PCA_hcc_g = PCA(n_components=3)
pc_hcc_genes = PCA_hcc_g.fit_transform(data_genes)
data_pr3_g = pd.DataFrame(data = pc_hcc_genes
             , columns = ['PC1', 'PC2', 'PC3'])
print('Explained variation per principal component: {}'.format(PCA_hcc_g.explained_variance_ratio_)) 
Explained variation per principal component: [0.64301682 0.06518905 0.0184779 ]
In [ ]:
x = np.array(data_pr3_g['PC1'])
y = np.array(data_pr3_g['PC2'])
plt.scatter(x, y, c="green", s=20)
plt.title("MCF7")
plt.xlabel('PC1')
plt.ylabel('PC2')

plt.show()
In [ ]:
x = np.array(data_pr3_g['PC1'])
y = np.array(data_pr3_g['PC2'])
plt.scatter(x, y, c="green", s=20)
plt.title("HCC1806")
plt.xlabel('PC1')
plt.ylabel('PC2')

plt.show()

The plots are clearly similar for both cell lines.

Clustering¶

Clustering is a crucial tool to gain insights on the datasets, especially when we have an enormous amount of features and it is difficult to understand how the data is structured. Ideally, we would like to obtain 2 clusters which we could identify with cells cultivated in Hypoxia and cells cultivated in Normoxia.

The types of clustering used are:

  • K-means
  • Agglomerative
  • UMAP (only for HCC1806)

Clustering in full dimensions and visualization of the results with PCA¶

We start by doing the clustering in full dimensions and then plotting the clusters found with PCA.

K-Means¶

We try out some methods to determine the right number of clusters.

Elbow method¶

The elbow is a heuristic method consisting in plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.

In [ ]:
fig, ax = plt.subplots()

visualizer = KElbowVisualizer(KMeans(random_state=42), k=(2,7), ax=ax)
visualizer.fit(data_train)
ax.set_xticks(range(2,7))
visualizer.show()
plt.show()

Silhouette score¶

Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters in a range of [-1, 1].
Silhouette coefficients close to +1 indicate that the sample is far from the neighboring clusters. A value of 0 indicates that the sample is very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. The Silhouette score is a mean of the values.

In [ ]:
from sklearn.metrics import silhouette_score

silhouette_scores = []
for k in range(2, 7):
    km = KMeans(n_clusters=k, 
                max_iter=300, 
                tol=1e-04, 
                init='k-means++', 
                n_init=10, 
                random_state=42, 
                algorithm='auto')
    km.fit(data_train)
    silhouette_scores.append(silhouette_score(data_train, km.labels_))

fig, ax = plt.subplots()
ax.plot(range(2, 7), silhouette_scores, color="black")
#ax.set_title('Silhouette Score Method')
ax.set_xlabel('Number of clusters')
ax.set_ylabel('Silhouette Scores')
plt.xticks(range(2, 7))
plt.tight_layout()
plt.show()

Silhouette analysis¶

In [ ]:
def silhouette_plot(X, model, ax, colors):
    y_lower = 10
    y_tick_pos_ = []
    sh_samples = silhouette_samples(X, model.labels_)
    sh_score = silhouette_score(X, model.labels_)
    
    for idx in range(model.n_clusters):
        values = sh_samples[model.labels_ == idx]
        values.sort()
        size = values.shape[0]
        y_upper = y_lower + size
        ax.fill_betweenx(np.arange(y_lower, y_upper),0,values,
                         facecolor=colors[idx],edgecolor=colors[idx]
        )
        y_tick_pos_.append(y_lower + 0.5 * size)
        y_lower = y_upper + 10

    ax.axvline(x=sh_score, color="red", linestyle="--", label="Avg Silhouette Score")
    ax.set_title("Silhouette Plot for {} clusters".format(model.n_clusters))
    l_xlim = max(-1, min(-0.1, round(min(sh_samples) - 0.1, 1)))
    u_xlim = min(1, round(max(sh_samples) + 0.1, 1))
    ax.set_xlim([l_xlim, u_xlim])
    ax.set_ylim([0, X.shape[0] + (model.n_clusters + 1) * 10])
    ax.set_xlabel("silhouette coefficient values")
    ax.set_ylabel("cluster label")
    ax.set_yticks(y_tick_pos_)
    ax.set_yticklabels(str(idx) for idx in range(model.n_clusters))
    ax.xaxis.set_major_locator(ticker.MultipleLocator(0.1))
    ax.legend(loc="best")
    return ax

k_max = 7
ncols = 3
nrows = k_max // ncols + (k_max % ncols > 0)
fig = plt.figure(figsize=(15,15), dpi=200)

for k in range(2,k_max+1):
    
    km = KMeans(n_clusters=k, 
                max_iter=300, 
                tol=1e-04, 
                init='k-means++', 
                n_init=10, 
                random_state=42, 
                algorithm='auto')

    km_fit = km.fit(data_train)
    
    ax = plt.subplot(nrows, ncols, k-1)
    silhouette_plot(data_train, km_fit,ax, cluster_colors)

fig.suptitle("Silhouette plots", fontsize=18, y=1)
plt.tight_layout()
plt.show()

Analyzing these plots, we understand that the best choice of clusters should be 2. The elbow method also suggests that clustering with k=3 makes sense. This means that there may be a further division between cells in addition to the basic 'Hypoxia' and 'Normoxia'.

We also see that for every choice of k there is a cluster that is more defined, the bigger one.

Let's proceed with clustering:

  • we perform the k-mean clustering;
  • we plot the result on the principal components;
  • we perform a cluster diagnosis, analysing the cardinality and magnitude of each cluster.

2 clusters¶

In [ ]:
kmeans = KMeans(n_clusters=2, random_state=2352).fit(data_train)
kmeans.labels_
array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 1, 1, 1], dtype=int32)
In [ ]:
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=kmeans.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title("2-means clustering")
plt.show()
In [ ]:
KM_plot(20, 120, kmeans)
Elevation: 20  Azimut: 120
In [ ]:
diagnoses(kmeans, data_train, cluster_colors)

3 clusters¶

In [ ]:
kmeans2 = KMeans(n_clusters=3, random_state=2352).fit(data_train)
kmeans2.labels_
array([1, 1, 1, 1, 0, 2, 0, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 1, 1, 1,
       1, 0, 2, 2, 0, 0, 1, 1, 1, 1, 1, 1, 2, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 2, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 2, 2, 0, 0, 2, 0, 1, 1, 1, 1, 1, 0, 0, 2, 0, 2, 0, 1, 1,
       1, 1, 1, 1, 2, 0, 2, 0, 0, 2, 1, 1, 1, 1, 1, 1, 0, 2, 2, 0, 0, 2,
       1, 1, 1, 1, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 2, 2, 2, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 2, 0, 2, 1, 1, 1,
       1, 1, 1, 2, 0, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 2, 2, 0, 2, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 0, 0, 0, 2, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 2, 2, 0, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 2], dtype=int32)
In [ ]:
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=kmeans2.labels_, cmap=ListedColormap(cluster_colors[:3]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title("3-means clustering")
plt.show()
In [ ]:
KM_plot(20, 120, kmeans2)
Elevation: 20  Azimut: 120
In [ ]:
diagnoses(kmeans2, data_train, cluster_colors)

Doing clustering we identify two main clusters. Comparing the plots with the visualization of PCA, we understand that these two clusters effectively divide cells into normoxic and hypoxic with a high accuracy.

Let's quantify how good is this division by defining a clustering accuracy.

In [ ]:
def clustering_accuracy(clust_labels, og_labels):
    print("Clustering accuracy:",
          max(np.count_nonzero(clust_labels == og_labels) * 100 / len(og_labels), 100 - np.count_nonzero(clust_labels == og_labels) * 100 / len(og_labels)),
            "%")
In [ ]:
og_labels = data_pr2_lab["Condition"].values
clust_predict = kmeans.labels_
clustering_accuracy(og_labels, clust_predict)
Clustering accuracy: 97.2 %

Hence, K-means clustering is able to distinguish cells with a 97.2% precision score, measured as the number of correct classifications divided by the total number of samples.

Doing a 3-means clustering, we notice that we can still visualize the Normoxic cells' cluster that we detected with a 2-clustering, while the cluster corresponding to Hypoxic cells is splitted in two halves. The interpretation we could give is that there are two subclasses of Hypoxic cells, that may be related to a different level of oxygen supply (the blue cluster could be the cells with less oxygen) and to some other factors that should be discussed with an expert.

Agglomerative Clustering¶

We perform agglomerative clustering using the standard euclidean distance, that fits quite well the task of identifying the distance between cells, and using the ward linkage. Other linkage were tried out (single, average and complete) but the results were much worse.

In [ ]:
agglomerative = AgglomerativeClustering().fit(data_train)
agglomerative.labels_
array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 0])
In [ ]:
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=agglomerative.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
In [ ]:
AG_plot(20,120)
In [ ]:
agg_predict = agglomerative.labels_
clustering_accuracy(og_labels, agg_predict)
Clustering accuracy: 98.0 %

We can also plot the results in a dendogram:

In [ ]:
plot_dendrogram(agglomerative)

The performed agglomerative clustering seems to confirm the results of the k-means clustering: the accuracy is 98%.

Now we can proceed by performing clustering on the space defined by the first 2 and 3 principal components. We start by performing agglomerative clustering.

Clustering of principal components¶

Agglomerative Clustering¶

In [ ]:
agg_PC2 = AgglomerativeClustering().fit(data_pr2)
agg_PC2.labels_
array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0])
In [ ]:
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=agg_PC2.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
In [ ]:
aggPC2_predict = agg_PC2.labels_
clustering_accuracy(og_labels, aggPC2_predict)
Clustering accuracy: 99.2 %
In [ ]:
agg_PC3 = AgglomerativeClustering().fit(data_pr3)
agg_PC3.labels_
array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 0])
In [ ]:
aggPC3_predict = agg_PC3.labels_
clustering_accuracy(og_labels, aggPC3_predict)
Clustering accuracy: 98.4 %
In [ ]:
AGPC_plot_int(20,120)

We can see that the cluster accuracy is even higher, both in the two and three dimensional spaces of principal components. Since the accuracy is already high, we avoid performing k-means on the space of principal components.

Now we pass to clustering for HCC1806, as the results are quite different. In particular, we will see how the resulting clusters do not resemble the division in hypoxic and normoxic groups, previously visualized with PCA. The methods, technique and analysis are similar to the ones used for MCF7.

K-Means¶

Elbow method¶

In [ ]:
fig, ax = plt.subplots()

visualizer = KElbowVisualizer(KMeans(random_state=42), k=(2,7), ax=ax)
visualizer.fit(data_train)
ax.set_xticks(range(2,7))
visualizer.show()
plt.show()
/Users/emanuelemarinolibrandi/opt/anaconda3/lib/python3.9/site-packages/yellowbrick/utils/kneed.py:156: YellowbrickWarning:

No 'knee' or 'elbow point' detected This could be due to bad clustering, no actual clusters being formed etc.

/Users/emanuelemarinolibrandi/opt/anaconda3/lib/python3.9/site-packages/yellowbrick/cluster/elbow.py:374: YellowbrickWarning:

No 'knee' or 'elbow' point detected, pass `locate_elbow=False` to remove the warning

No knee or elbow point is detected in this case: we could already say that maybe cells in HCC1806 are not clearly diveded into clusters.

Silhouette score¶

In [ ]:
from sklearn.metrics import silhouette_score

silhouette_scores = []
for k in range(2, 7):
    km = KMeans(n_clusters=k, 
                max_iter=300, 
                tol=1e-04, 
                init='k-means++', 
                n_init=10, 
                random_state=42, 
                algorithm='auto')
    km.fit(data_train)
    silhouette_scores.append(silhouette_score(data_train, km.labels_))

fig, ax = plt.subplots()
ax.plot(range(2, 7), silhouette_scores, color="black")
#ax.set_title('Silhouette Score Method')
ax.set_xlabel('Number of clusters')
ax.set_ylabel('Silhouette Scores')
plt.xticks(range(2, 7))
plt.tight_layout()
plt.show()

Silhouette analysis¶

In [ ]:
def silhouette_plot(X, model, ax, colors):
    y_lower = 10
    y_tick_pos_ = []
    sh_samples = silhouette_samples(X, model.labels_)
    sh_score = silhouette_score(X, model.labels_)
    
    for idx in range(model.n_clusters):
        values = sh_samples[model.labels_ == idx]
        values.sort()
        size = values.shape[0]
        y_upper = y_lower + size
        ax.fill_betweenx(np.arange(y_lower, y_upper),0,values,
                         facecolor=colors[idx],edgecolor=colors[idx]
        )
        y_tick_pos_.append(y_lower + 0.5 * size)
        y_lower = y_upper + 10

    ax.axvline(x=sh_score, color="red", linestyle="--", label="Avg Silhouette Score")
    ax.set_title("Silhouette Plot for {} clusters".format(model.n_clusters))
    l_xlim = max(-1, min(-0.1, round(min(sh_samples) - 0.1, 1)))
    u_xlim = min(1, round(max(sh_samples) + 0.1, 1))
    ax.set_xlim([l_xlim, u_xlim])
    ax.set_ylim([0, X.shape[0] + (model.n_clusters + 1) * 10])
    ax.set_xlabel("silhouette coefficient values")
    ax.set_ylabel("cluster label")
    ax.set_yticks(y_tick_pos_)
    ax.set_yticklabels(str(idx) for idx in range(model.n_clusters))
    ax.xaxis.set_major_locator(ticker.MultipleLocator(0.1))
    ax.legend(loc="best")
    return ax

k_max = 7
ncols = 3
nrows = k_max // ncols + (k_max % ncols > 0)
fig = plt.figure(figsize=(15,15), dpi=200)

for k in range(2,k_max+1):
    
    km = KMeans(n_clusters=k, 
                max_iter=300, 
                tol=1e-04, 
                init='k-means++', 
                n_init=10, 
                random_state=42, 
                algorithm='auto')

    km_fit = km.fit(data_train)
    
    ax = plt.subplot(nrows, ncols, k-1)
    silhouette_plot(data_train, km_fit,ax, cluster_colors)

fig.suptitle("Silhouette plots", fontsize=18, y=1)
plt.tight_layout()
plt.show()

Contrary to MCF7, here we do not have any big cluster for any choice of k.

Let's perform the clustering.

2 clusters¶

In [ ]:
kmeans = KMeans(n_clusters=2, random_state=2352).fit(data_train)
kmeans.labels_
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0], dtype=int32)
In [ ]:
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=kmeans.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title("2-means clustering")
plt.show()
In [ ]:
KM_plot(20, 100, kmeans)
Elevation: 20  Azimut: 100

3 clusters¶

In [ ]:
kmeans2 = KMeans(n_clusters=3, random_state=2352).fit(data_train)
kmeans2.labels_
array([0, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1,
       2, 1, 1, 0, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 0, 1, 1, 1, 1, 2, 2,
       2, 1, 1, 1, 1, 2, 2, 2, 2, 0, 1, 0, 1, 1, 2, 2, 2, 1, 1, 0, 2, 2,
       2, 2, 0, 1, 1, 0, 2, 0, 2, 1, 1, 0, 1, 2, 2, 0, 0, 0, 1, 1, 2, 0,
       2, 2, 0, 2, 0, 1, 0, 1, 2, 2, 1, 1, 2, 0, 2, 1, 1, 1, 1, 1, 2, 2,
       0, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1,
       0, 1, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 1, 0, 1, 0, 0, 0, 2, 0, 1, 1,
       1, 2, 0, 2, 2, 1, 1, 1, 0, 2, 2, 1, 0, 1, 1, 2, 2, 2, 0, 1, 1, 1,
       2, 2, 2, 1, 2, 2], dtype=int32)
In [ ]:
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=kmeans2.labels_, cmap=ListedColormap(cluster_colors[:3]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title("3-means clustering")
plt.show()
In [ ]:
KM_plot(20, 100, kmeans2)
Elevation: 20  Azimut: 100

The clustering does not gives us a great result: we can clearly see how many cells are misclassified. Since the centroids are randomly inizialized, we could try out different seeds: the result doesn't change anyway, it remains similar. Comparing the 3-means clustering with the classification of the cells, we see that it seems to define two clusters (out of three) that are more accurate. Now let's compute the accuracy of the clustering:

In [ ]:
og_labels = data_pr2_lab["Condition"].values
clust_predict = kmeans.labels_
clustering_accuracy(og_labels, clust_predict)
Clustering accuracy: 51.64835164835165 %

Hence, K-means clustering is able to distinguish cells with a 51.65% precision score, measured as the number of correct classifications divided by the total number of samples.

Agglomerative Clustering¶

In [ ]:
agglomerative = AgglomerativeClustering().fit(data_train)
agglomerative.labels_
array([1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0])
In [ ]:
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=agglomerative.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
In [ ]:
AG_plot_k(20,100,2)
In [ ]:
agg_predict = agglomerative.labels_
clustering_accuracy(og_labels, agg_predict)
Clustering accuracy: 61.53846153846154 %

The accuracy of agglomerative clustering is thus higher: agglomerative clustering gives better results.

Clustering of principal components¶

We perform only agglomerative clustering, as it seems more promising than k-means.

Agglomerative Clustering¶

In [ ]:
agg_PC2 = AgglomerativeClustering().fit(data_pr2)
agg_PC2.labels_
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 0, 1, 1])
In [ ]:
aggPC2_predict = agg_PC2.labels_
clustering_accuracy(og_labels, aggPC2_predict)
Clustering accuracy: 51.0989010989011 %
In [ ]:
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=agg_PC2.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
In [ ]:
agg_PC3 = AgglomerativeClustering().fit(data_pr3)
agg_PC3.labels_
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0])
In [ ]:
aggPC3_predict = agg_PC3.labels_
clustering_accuracy(og_labels, aggPC3_predict)
Clustering accuracy: 57.142857142857146 %
In [ ]:
AGPC_plot_int(20,100)

In the space of the principal components, agglomerative clustering performs worse.

Overall, clustering on HCC1806 does not give good results in trying to dividing cells into Normoxia and Hypoxia clusters. We should try different clustering methods to get a better division. We try out clustering using UMAP dimensionality reduction.

Clustering using UMAP¶

UMAP is a nonlinear dimensionality reduction technique that aims to preserve the local and global structure of the data. It constructs a high-dimensional graph representation of the data, where each data point is connected to its nearest neighbors and then optimizes the embedding of the data points in a lower-dimensional space in a way that the distances between connected points in the high-dimensional graph are preserved as closely as possible in the lower-dimensional embedding.

After dimensionality reduction with UMAP, we perform k-means clustering on the space of the reduced components.

In [ ]:
UMA = KMeans(n_clusters=2)
labels_UM = UMA.fit_predict(embedding)
In [ ]:
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels_UM, cmap=my_cmap)
plt.show()
In [ ]:
clustering_accuracy(og_labels, labels_UM)
Clustering accuracy: 86.81318681318682 %

The accuracy of this technique is significantly higher.

Genes¶

We perform clustering on genes, to get more insights on the matter. We do it both in full dimension, projecting the results with PCA, and in the space of the principal components. We use the same methods and techniques as before:

  • k-means clustering (establishing the number of clusters with the methods described before)
  • agglomerative clustering

Clustering in full dimensions and visualization of the results with PCA¶

K-Means¶

We start by the determining the right number of clusters (with the same methods used before).

Elbow method¶

In [ ]:
fig, ax = plt.subplots()

visualizer = KElbowVisualizer(KMeans(), k=(2,7),ax=ax)
visualizer.fit(data_genes)
ax.set_xticks(range(2,7))
visualizer.show()
plt.show()

Silhouette score¶

In [ ]:
silhouette_scores = []
for k in range(2, 7):
    km = KMeans(n_clusters=k, 
                max_iter=300, 
                tol=1e-04, 
                init='k-means++', 
                n_init=10, 
                random_state=42, 
                algorithm='auto')
    km.fit(data_genes)
    silhouette_scores.append(silhouette_score(data_genes, km.labels_))

fig, ax = plt.subplots()
ax.plot(range(2, 7), silhouette_scores, 'bx-', color="black")
ax.set_title('Silhouette Score Method')
ax.set_xlabel('Number of clusters')
ax.set_ylabel('Silhouette Scores')
plt.xticks(range(2, 7))
plt.tight_layout()
plt.show()
/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35472/1442953304.py:14: UserWarning:

color is redundantly defined by the 'color' keyword argument and the fmt string "bx-" (-> color='b'). The keyword argument will take precedence.

In [ ]:
def silhouette_plot(X, model, ax, colors):
    y_lower = 10
    y_tick_pos_ = []
    sh_samples = silhouette_samples(X, model.labels_)
    sh_score = silhouette_score(X, model.labels_)
    
    for idx in range(model.n_clusters):
        values = sh_samples[model.labels_ == idx]
        values.sort()
        size = values.shape[0]
        y_upper = y_lower + size
        ax.fill_betweenx(np.arange(y_lower, y_upper),0,values,
                         facecolor=colors[idx],edgecolor=colors[idx]
        )
        y_tick_pos_.append(y_lower + 0.5 * size)
        y_lower = y_upper + 10

    ax.axvline(x=sh_score, color="red", linestyle="--", label="Avg Silhouette Score")
    ax.set_title("Silhouette Plot for {} clusters".format(model.n_clusters))
    l_xlim = max(-1, min(-0.1, round(min(sh_samples) - 0.1, 1)))
    u_xlim = min(1, round(max(sh_samples) + 0.1, 1))
    ax.set_xlim([l_xlim, u_xlim])
    ax.set_ylim([0, X.shape[0] + (model.n_clusters + 1) * 10])
    ax.set_xlabel("silhouette coefficient values")
    ax.set_ylabel("cluster label")
    ax.set_yticks(y_tick_pos_)
    ax.set_yticklabels(str(idx) for idx in range(model.n_clusters))
    ax.xaxis.set_major_locator(ticker.MultipleLocator(0.1))
    ax.legend(loc="best")
    return ax

k_max = 7
ncols = 3
nrows = k_max // ncols + (k_max % ncols > 0)
fig = plt.figure(figsize=(15,15), dpi=200)

for k in range(2,k_max+1):
    
    km = KMeans(n_clusters=k, 
                max_iter=300, 
                tol=1e-04, 
                init='k-means++', 
                n_init=10, 
                random_state=42, 
                algorithm='auto')

    km_fit = km.fit(data_genes)
    
    ax = plt.subplot(nrows, ncols, k-1)
    silhouette_plot(data_genes, km_fit,ax, genes_colors)

fig.suptitle("Silhouette plots", fontsize=18, y=1)
plt.tight_layout()
plt.show()

These analysis suggest us that the best number of clusters should be either two (silhouette) or three (elbow). Moreover, from the silhouette plot we clearly see that for any choice of k there is a big, main cluster.

2 clusters¶

In [ ]:
kmeans_g2 = KMeans(n_clusters=2, random_state=1324).fit(data_genes)
kmeans_g2.labels_
array([1, 1, 1, ..., 1, 0, 0], dtype=int32)
In [ ]:
KM_plot_k_int(30, 120, 2)
In [ ]:
diagnoses(kmeans_g2, data_genes, genes_colors)

3 clusters¶

In [ ]:
kmeans_g3 = KMeans(n_clusters=3, random_state=1324).fit(data_genes)
kmeans_g3.labels_
array([2, 2, 1, ..., 1, 0, 0], dtype=int32)
In [ ]:
KM_plot_k_int(30, 120, 3)
In [ ]:
diagnoses(kmeans_g3, data_genes, genes_colors)

We also try with k=4, to see if we can spot other "classes" of genes.

4 clusters¶

In [ ]:
kmeans_g4 = KMeans(n_clusters=4, random_state=1324).fit(data_genes)
kmeans_g4.labels_
array([3, 3, 1, ..., 1, 0, 0], dtype=int32)
In [ ]:
KM_plot_k_int(30, 120, 4)
In [ ]:
diagnoses(kmeans_g4, data_genes, genes_colors)

Agglomerative clustering¶

In [ ]:
agglomerative_g2 = AgglomerativeClustering().fit(data_pr3_g)
agglomerative_g2.labels_
array([0, 0, 0, ..., 0, 1, 1])

We also perform an agglomerative clustering, again using the euclidean distance and the ward linkage, visualizing the result with a plot.

In [ ]:
AG_g_plot(30,120)

The execution of all these tasks on HCC1806 gave us very similar results. Indeed, as noticed before, the "distribution" of genes seems to be very similar between the two cell lines. The only thing we want to point out regards the 4-means clustering, in which we don't have a slightly different separation, that maybe could mean something from a biological point of view.

In [ ]:
KM_plot_k_int(30,120,4)

Supervised Learning¶

We now delve into the heart of supervised machine learning methods to understand the dynamics of our gene expression data across the different sequencing techniques and cell types. The goal is to create models that are not only accurate but also offer insights into the nature of the data and the underlying biological processes.

We have four datasets at our disposal - MCF7 and HCC1806, each sequenced with both SmartSeq and DropSeq techniques. For each of these datasets, we've employed a range of supervised learning algorithms: Support Vector Machines (SVM), Random Forests, and Logistic Regression. HCC1806 - DropSeq makes an exception for this, when we have also attempted a MLP Classifier.

The choice of these algorithms was influenced by their diverse strengths. SVMs are particularly adept at handling high-dimensional data, a common characteristic of gene expression datasets. Random Forests, on the other hand, are known for their robustness to overfitting and their ability to handle nonlinear relationships. Logistic Regression, while seemingly simpler, is a highly interpretable model that can provide insights into which genes are most informative in distinguishing between cell types.

To optimize the performance of each of these models, we undertook hyperparameter tuning. This process was carried out with a focus on achieving a fine balance between computational complexity and model performance. The underlying premise was to ensure that our models are not only accurate but also efficient - a crucial aspect when dealing with large-scale gene expression data.

In order to build a more powerful classifier, we exploited the power of ensemble learning: by leveraging the strengths of multiple learning algorithms, we aimed to construct an ensemble model that offers improved predictive performance and robustness. This integrative approach often helps to achieve better performance by capturing more complex underlying structures in the data.

Given the extent of our analysis, we have decided to maintain here a focus on a single dataset, i.e. MCF7 SmartSeq - this will allow us to delve deeper into the analytical process without compromising readability. However, please note that all analyses were conducted similarly across all datasets, and we will bring in results from other datasets where they offer interesting contrasts or confirmations.

In all classifiers we follow the same steps:

  1. Tune the hyperparameters and select the best model;

  2. Make some plots: decision boundary and accuracy vs number of features;

  3. Performance on test set.

Libraries and methods¶

Here, we list the libraries and methods we have used from Scikit's library. Then, we import dataset and add the target labels and we split into training and test sets. Test sets will be used in the evaluation section to assess the performance of the model on unseen data. In regards to the MCF7 dataset, we observe a relatively balanced distribution across the 2 labels. Therefore we believed accuracy was an appropriate evaluation metric for our case. Accuracy is not only straightforward and easy to interpret, but it's also widely recognized and used in the field. This will allow us to maintain clarity in our performance assessment while ensuring the results are still meaningful.

In [ ]:
#Importing the dataset and adding label
df = pd.read_csv("drive/MyDrive/Datasets/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt", sep="\ ")
df = df.T
df['label'] = df.index.to_series().apply(lambda x: 'Normoxia' if 'Norm' in x else 'Hypoxia')
<ipython-input-6-c05a2eaee16d>:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  df = pd.read_csv("drive/MyDrive/Datasets/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt", sep="\ ")
In [ ]:
df["label"].value_counts() #pretty balanced! accuracy is fine
Normoxia    126
Hypoxia     124
Name: label, dtype: int64
In [ ]:
#Creating X and y
X = df.drop("label", axis = 1)
y = df["label"]
In [ ]:
#Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(187, 3000) (187,) (63, 3000) (63,)

1. Logistic Regression¶

We use the standard Logistic Regression object from Scikit and we tune the coefficient C (inverse of regularization strength) within a set of values. Note that here we employed for cross validation the `neg_log_loss scoring, as it is the model that deals with probabilistic outputs.

In [ ]:
log = LogisticRegression(solver='liblinear')
params_log = {'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

log_gs = GridSearchCV(log, params_log, cv=5, scoring=['neg_log_loss'], refit='neg_log_loss')
log_gs.fit(X_train, y_train)
In [ ]:
log_gs.best_estimator_
LogisticRegression(C=1, penalty='l1', solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=1, penalty='l1', solver='liblinear')
In [ ]:
best_log = log_gs.best_estimator_
In [ ]:
log_gs.best_params_
{'C': 10, 'penalty': 'l1'}

This will be our chosen model for Logistic Regression: it will use l1 penalty and C equal to 10. As simple as it can seem, its performance is outstanding:

In [ ]:
acc_log = cross_val_score(best_log, X_train, y_train).mean() #will be used as weight in Ensemble Classifier
In [ ]:
ypredlog = best_log.predict(X_test)
accuracy_logistic = accuracy_score(y_test, ypredlog)
accuracy_logistic
1.0

We can also explore how the accuracy of the model behaves as we change the number of features we train it with. As a reference, look at this graph:

In [ ]:
features_range = range(1, 101, 5)
scores = []

for n in features_range:
    # Select top n features
    selector = SelectKBest(mutual_info_classif, k=n)
    X_new = selector.fit_transform(X_train, y_train)

    # Train the model
    model = LogisticRegression(C=10, penalty='l1', solver='liblinear')
    score = cross_val_score(model, X_new, y_train, cv=5, scoring='accuracy').mean()
    scores.append(score)

plt.figure(figsize=(10, 6))
plt.plot(features_range, scores, marker='o')
plt.xlabel('Number of features')
plt.ylabel('Accuracy')
plt.title('Number of features vs Accuracy')
plt.grid(True)
plt.show()

In this analysis, we incrementally increased the number of features, or genes, utilized by the model in increments of five. These "best genes" were carefully selected based on their calculated mutual information, following the approach we employed in earlier sections of this project.

The model's performance accelerates rapidly, achieving perfect accuracy when about 70 genes are incorporated. This exceptional performance may be due to the excellent quality of our dataset. The genes included have been largely curated, specifically chosen for their explanatory power in distinguishing between hypoxia and normoxia conditions.

2. Support Vector Machines¶

Our next model is SVM, and to implement it we are going to use the SVC() class from Scikit.

The procedure for this section follows essentially the other ones, with a noteworthy addition of a discussion on precision-recall tradeoff. We believe this quick remark on error analysis enables a more comprehensive understanding of our model performance.

As SVM's complexity increases dramatically with dimensionality, we employed different strategies based on dataset size. For larger DropSeq datasets, we opted for Randomized Search over Grid Search for efficiency. Due to significant running times, we also had to simplify the process by reducing the number of hyperparameters to be tuned.

In [ ]:
# Define the parameter grid
param_grid = {'kernel': ['rbf', 'sigmoid', 'poly', 'linear'], 'C': [0.1, 1, 10, 100], 'gamma': [1, 10, 100], 'degree': [2, 3, 4, 5]}

# Create the SVM model
svm_model = SVC()

# Perform grid search with cross-validation
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
In [ ]:
grid_search.fit(X_train, y_train)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py:378: FitFailedWarning: 
100 fits failed out of a total of 960.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
100 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/svm/_base.py", line 270, in fit
    raise ValueError(
ValueError: The dual coefficients or intercepts are not finite. The input data may contain large values and need to bepreprocessed.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:952: UserWarning: One or more of the test scores are non-finite: [0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
 0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
 0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
        nan 0.99473684 0.50810811 0.50810811        nan 0.99473684
 0.50810811 0.50810811        nan 0.99473684 0.50810811 0.50810811
        nan 0.99473684 0.50810811 0.50810811        nan 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
 0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
 0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
        nan 0.99473684 0.50810811 0.50810811        nan 0.99473684
 0.50810811 0.50810811        nan 0.99473684 0.50810811 0.50810811
        nan 0.99473684 0.50810811 0.50810811        nan 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
 0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
 0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
        nan 0.99473684 0.50810811 0.50810811        nan 0.99473684
 0.50810811 0.50810811        nan 0.99473684 0.50810811 0.50810811
        nan 0.99473684 0.50810811 0.50810811        nan 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
 0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
 0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
 0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
        nan 0.99473684 0.50810811 0.50810811        nan 0.99473684
 0.50810811 0.50810811        nan 0.99473684 0.50810811 0.50810811
        nan 0.99473684 0.50810811 0.50810811        nan 0.99473684]
  warnings.warn(
GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': [0.1, 1, 10, 100], 'degree': [2, 3, 4, 5],
                         'gamma': [1, 10, 100],
                         'kernel': ['rbf', 'sigmoid', 'poly', 'linear']},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': [0.1, 1, 10, 100], 'degree': [2, 3, 4, 5],
                         'gamma': [1, 10, 100],
                         'kernel': ['rbf', 'sigmoid', 'poly', 'linear']},
             scoring='accuracy')
SVC()
SVC()
In [ ]:
# Best model
best_svm = grid_search.best_estimator_

# Get the best parameter values
best_parameters = grid_search.best_params_
best_parameters
{'C': 0.1, 'degree': 2, 'gamma': 1, 'kernel': 'poly'}
In [ ]:
# Accuracy on training data
cross_val_score(best_svm, X_train, y_train, cv=5, scoring="accuracy")
array([0.97368421, 1.        , 1.        , 1.        , 1.        ])
In [ ]:
acc_svm = cross_val_score(best_svm, X_train, y_train).mean() #will be used later
In [ ]:
# Confusion matrix
predictions = cross_val_predict(best_svm, X_train, y_train, cv=3)
conf_matrix = confusion_matrix(y_train, predictions)
conf_matrix
array([[92,  0],
       [ 1, 94]])
In [ ]:
# With percentages
row_sums = conf_matrix.sum(axis=1, keepdims=True)
norm_conf_matrix = np.round(conf_matrix / row_sums, 2)
norm_conf_matrix
array([[1.  , 0.  ],
       [0.01, 0.99]])
In [ ]:
#Prediction and recall
print("Precision score =",conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[0, 1]))
print("Recall score =",conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 0]))
Precision score = 1.0
Recall score = 0.9894736842105263

Comment on precision and recall

In the context of our analysis, cells subjected to Hypoxia emerge as potential indicators of malign tumour. The paramount objective, therefore, would be to accurately flag these cells, considering their vital role in the onset of cancer.

Consequently, one strategy could involve orienting our classifier to prioritize the accurate identification of Hypoxia cells, even at the risk of occasional misclassifications (such as falsely labelling Normoxia cells as Hypoxia, the so-called false positives). This necessitates a classifier with modest recall, yet high precision.

However, this approach deviates from our initial assignment. Our principal task is to adeptly distinguish between the two cellular conditions: Hypoxia and Normoxia, without an overemphasis on either. Our mission remains unbiased discernment rather than prioritized detection, but this is something a doctor may want to consider.

Decision Boundary

In [ ]:
#Splitting dataset
df_train, df_test = train_test_split(df, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
df_train = df_train.transpose()
(187, 3000) (187,) (63, 3000) (63,)
In [ ]:
def find_word(string, word1, word2):

    string = ''.join(filter(str.isalpha, string))
    word1 = ''.join(filter(str.isalpha, word1))

    for i in range(len(string) - len(word1) + 1):
        if string[i:i+len(word1)] == word1:
            return word1
        
    return word2
    
def remove_double_quotes(word):
    return word.replace('"', '')
In [ ]:
df_train = df_train.rename(columns={"{}".format(i):"{}".format(remove_double_quotes(i)) for i in df_train.columns})
df_train = df_train.rename(columns={"{}".format(i):"{}".format(find_word(i, "Norm", "Hypo")) for i in df_train.columns})
df_train = df_train.transpose()
df_train = df_train.drop(columns=['label'])
df_train
"CYP1B1" "CYP1B1-AS1" "CYP1A1" "NDRG1" "DDIT4" "PFKFB3" "HK2" "AREG" "MYBL2" "ADM" ... "CD27-AS1" "DNAI7" "MAFG" "LZTR1" "BCO2" "GRIK5" "SLC25A27" "DENND5A" "CDK5R1" "FAM13A-AS1"
Hypo 14546 5799 6817 338 3631 460 1259 0 76 0 ... 0 0 0 0 0 0 0 0 0 0
Hypo 6734 2631 226 1203 6612 3025 961 142 32 838 ... 20 0 54 33 0 0 0 109 0 0
Hypo 4099 1583 0 401 1877 1691 274 1220 300 234 ... 0 0 26 151 0 0 0 58 0 0
Norm 196 102 1 243 266 278 78 1 199 0 ... 79 0 1 0 0 0 0 45 19 0
Hypo 4596 1689 5136 1496 4329 3666 3566 77 173 124 ... 0 0 0 0 0 0 0 39 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Hypo 29803 12073 8024 1414 7148 4941 2937 468 293 486 ... 0 0 3 0 0 0 0 129 0 0
Hypo 1338 554 14 634 3513 1360 303 558 178 994 ... 0 0 46 5 0 0 0 14 0 0
Hypo 12647 5175 61 608 4343 1175 1410 39 1 1946 ... 24 0 17 0 0 0 0 101 0 22
Hypo 5954 2311 0 3884 12034 5986 5103 0 0 1242 ... 0 0 235 0 0 0 0 10 0 21
Norm 0 0 0 0 196 3 0 1 461 0 ... 0 0 62 0 0 0 0 21 0 0

187 rows × 3000 columns

In [ ]:
data_train_transpose = pd.DataFrame.transpose(df, copy=True)
data_train_transpose_lab = data_train_transpose.copy() #with labels
data_train_transpose_lab['label'] = data_train_transpose.index.to_series().apply(lambda x: 'Norm' if 'Norm' in x else 'Hypo')
# features = data_train_transpose.columns
In [ ]:
PCA2_data = PCA(n_components=2)
principalComponents_hcc2 = PCA2_data.fit_transform(df_train)
data_pr2 = pd.DataFrame(data = principalComponents_hcc2
             , columns = ['PC1', 'PC2'])
print('Explained variation per principal component: {}'.format(PCA2_data.explained_variance_ratio_)) 
Explained variation per principal component: [0.6446813  0.08999785]
In [ ]:
data_pr2_lab = data_pr2.copy()
data_pr2_lab["Condition"] = [i for i in range(len(df_train))]
for i in range(len(df_train)):
    if (df_train.index[i] == "Norm"):
        data_pr2_lab["Condition"][i] = 0
    elif (df_train.index[i] == "Hypo"):
        data_pr2_lab["Condition"][i] = 1
data_pr2_lab_copy = data_pr2_lab.copy()
data_pr2_lab_copy.drop('Condition', axis=1, inplace=True)
<ipython-input-15-4b9e55b262c8>:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_pr2_lab["Condition"][i] = 1
<ipython-input-15-4b9e55b262c8>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_pr2_lab["Condition"][i] = 0
In [ ]:
kernels = ['linear', 'rbf', 'sigmoid', 'poly']
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))

for idx, kernel in enumerate(kernels):
    svm_model = SVC(kernel=kernel, C=0.1)
    svm_model.fit(data_pr2_lab_copy, data_pr2_lab["Condition"])

    ax = axes[idx // 2][idx % 2]

    # Scatter plot of the data points (red points are cells in Normoxia, green ones are in Hypoxia)
    ax.scatter(x, y, c=data_pr2_lab["Condition"], cmap="prism")
    ax.set_title(kernel)
    ax.set_xlabel('PC1')
    ax.set_ylabel('PC2')

    # Create a mesh grid of points
    x_min, x_max = data_pr2_lab.iloc[:, 0].min() - 1, data_pr2_lab.iloc[:, 0].max() + 1
    y_min, y_max = data_pr2_lab.iloc[:, 1].min() - 1, data_pr2_lab.iloc[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 100), np.arange(y_min, y_max, 100))

    # Obtain predicted class labels for each point in the mesh grid
    Z = svm_model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot the decision boundary and the margin
    ax.contour(xx, yy, Z, colors='b', linewidths=0.5)

plt.suptitle('Decision Boundaries for different Kernels', fontsize=16)

plt.tight_layout()
plt.show()
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but SVC was fitted with feature names
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but SVC was fitted with feature names
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but SVC was fitted with feature names
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but SVC was fitted with feature names
  warnings.warn(

To analyze on the distinct properties of various kernels, we wanted to illustrate the decision boundaries rendered by each model. Given the necessity for visual representation in two dimensions, it was imperative to reduce our 3000-dimensional dataset to its two most informative axes. We accomplished this through Principal Component Analysis (PCA), selecting the two principal components that accounted for the maximum proportion of the dataset's variance. Thus, from the plots we can visualize the decision boundaries in a manner that encapsulates the critical features of our high-dimensional dataset.

Testing the number of features

In [ ]:
features_range = range(1, 101, 5)
scores = []

for n in features_range:
    # Select top n features
    selector = SelectKBest(mutual_info_classif, k=n)
    X_new = selector.fit_transform(X_train, y_train)

    # Train the model
    model = best_svm
    score = cross_val_score(model, X_new, y_train, cv=5, scoring='accuracy').mean()
    scores.append(score)

plt.figure(figsize=(10, 6))
plt.plot(features_range, scores, marker='o')
plt.xlabel('Number of features')
plt.ylabel('Accuracy')
plt.title('Number of features vs Accuracy')
plt.grid(True)
plt.show()

Accuracy on test set

In [ ]:
# Test on test
best_svm.fit(X_train, y_train)
test_accuracy = best_svm.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

3. Random Forest¶

We now move to Random Forest, which is itself a model of ensemble learning, as it bases its predictions on the set of Decision Trees it creates.

In [ ]:
rf = RandomForestClassifier(random_state=42)
params_rf = {"n_estimators": [25, 50, 100, 200, 300], "max_leaf_nodes" : np.arange(20, 100, 10)}
rf_gs = GridSearchCV(rf, params_rf, cv=5)

rf_gs.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_leaf_nodes': array([20, 30, 40, 50, 60, 70, 80, 90]),
                         'n_estimators': [25, 50, 100, 200, 300]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_leaf_nodes': array([20, 30, 40, 50, 60, 70, 80, 90]),
                         'n_estimators': [25, 50, 100, 200, 300]})
RandomForestClassifier(random_state=42)
RandomForestClassifier(random_state=42)
In [ ]:
rf_gs.best_estimator_
RandomForestClassifier(max_leaf_nodes=20, n_estimators=25, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_leaf_nodes=20, n_estimators=25, random_state=42)
In [ ]:
rf_gs.best_params_
{'max_leaf_nodes': 20, 'n_estimators': 25}
In [ ]:
best_rf = rf_gs.best_estimator_
In [ ]:
acc_rf = cross_val_score(best_rf, X_train, y_train).mean() #will be used later
1.0

Again, analyzing the number of features against the performance of the model, and the scores are quite impressive:

In [ ]:
scores = []

for n in features_range:
    # Select top n features
    selector = SelectKBest(mutual_info_classif, k=n)
    X_new = selector.fit_transform(X_train, y_train)

    # Train the model
    model = RandomForestClassifier(max_leaf_nodes = 20, n_estimators = 25, random_state = 42)
    score = cross_val_score(model, X_new, y_train, cv=5, scoring='accuracy').mean()
    scores.append(score)

plt.figure(figsize=(10, 6))
plt.plot(features_range, scores, marker='o')
plt.xlabel('Number of features')
plt.ylabel('Accuracy')
plt.title('Number of features vs Accuracy')
plt.grid(True)
plt.show()

Trying to investigate which features (genes in our case) are of most importance, we employ the feature_importances_ attribute of the trained RandomForest model. This attribute computes the mean decrease in impurity, which is observed when splitting the data based on a particular feature, averaged over all trees in the forest.

These genes stand out because their expression levels (either higher or lower than certain thresholds) provide pivotal information for the model to distinguish between cells that have been exposed to hypoxic versus normoxic conditions. Their higher importance scores indicate that changes in these genes' expression levels have a profound effect on the cell's response to oxygen levels, making them key players in our classification task.

In [ ]:
feature_importances = rf_gs.best_estimator_.feature_importances_
features = X.columns

# create DataFrame to hold the feature names and their corresponding importance scores
feature_importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': feature_importances
})

feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)

print(feature_importance_df)
           Feature  Importance
103       "MT-CYB"    0.071049
477      "FAM162A"    0.068452
22         "BNIP3"    0.061850
869       "ARPC1B"    0.059913
1589        "DOLK"    0.052570
...            ...         ...
1023       "PYCR3"    0.000000
1024       "KANK3"    0.000000
1025       "KRT83"    0.000000
1026      "ZNF592"    0.000000
2999  "FAM13A-AS1"    0.000000

[3000 rows x 2 columns]

A few comments are noteworhy at this point:

  • MT-CYB (Cytochrome B): this gene is part of the mitochondrial DNA and it codes for a component of the electron transport chain, which is crucial for cellular respiration. Mutations in this gene have been associated with various diseases, including some forms of cancer [1].
  • BNIP3: this is a gene known for its role in regulating cell death and survival. It's implicated in hypoxia-induced cell death and its dysregulation has been associated with various types of cancer[2].
  • ARPC1B: This gene is a part of the ARP2/3 complex involved in the regulation of actin polymerization. It is essential for cell motility and integrity of the cytoskeleton. Even though a certain relationship between this specific gene and cancer has not been yet established, dysregulation of genes involved in cell motility can contribute to metastasis in cancer[3]:.

[1]: https://pubmed.ncbi.nlm.nih.gov/18245469/

[2]: https://pubmed.ncbi.nlm.nih.gov/16357180/

[3]: https://www.nature.com/articles/nrc.2018.15

When applying the same methodology to the other datasets, such as that of HCC1806, we identify another set of significant genes: NDRG1, well-known to be involved in stress responses, cell growth, and differentiation; it has been identified as a potential tumor suppressor gene and is often downregulated in several types of cancer[4]. DDIT4, familiar to regulate cell response to stress and is often upregulated in response to hypoxia[5].

[4]: https://pubmed.ncbi.nlm.nih.gov/17316623/#:~:text=NDRG1%20is%20a%20hypoxia%2Dinducible,human%20hepatocellular%20carcinoma%20(HCC)

[5]: https://www.nature.com/articles/s41416-018-0368-3

Ensembling¶

We now move to develop an ensemble learning framework. In order to prioritize those models that performed better we stratify the voting procedure in accordance with the mean accuracy exhibited by each model on the validation sets. This approach ensures a higher influence for the more accurate models within the ensemble structure.

In [ ]:
best_models = [('log', best_log), ('svm', best_svm), ('rf', best_rf)]
accuracies = [acc_log, acc_svm, acc_rf]
ensemble = VotingClassifier(best_models, weights=accuracies)
ensemble.fit(X_train, y_train)
VotingClassifier(estimators=[('log',
                              LogisticRegression(C=1, penalty='l1',
                                                 solver='liblinear')),
                             ('svm',
                              SVC(C=0.1, degree=2, gamma=1, kernel='poly')),
                             ('rf',
                              RandomForestClassifier(max_leaf_nodes=20,
                                                     n_estimators=25,
                                                     random_state=42))],
                 weights=[1.0, 0.9682539682539683, 1.0])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
VotingClassifier(estimators=[('log',
                              LogisticRegression(C=1, penalty='l1',
                                                 solver='liblinear')),
                             ('svm',
                              SVC(C=0.1, degree=2, gamma=1, kernel='poly')),
                             ('rf',
                              RandomForestClassifier(max_leaf_nodes=20,
                                                     n_estimators=25,
                                                     random_state=42))],
                 weights=[1.0, 0.9682539682539683, 1.0])
LogisticRegression(C=1, penalty='l1', solver='liblinear')
SVC(C=0.1, degree=2, gamma=1, kernel='poly')
RandomForestClassifier(max_leaf_nodes=20, n_estimators=25, random_state=42)
In [ ]:
predictions = ensemble.predict(X_test)
accuracy_score(y_test, predictions)
1.0

DropSeq HCC1808 's NN¶

We decided to devote a unique section of our project to the investigation of HCC1806 - DropSeq dataset. Indeed, this was the most challenging dataset we had to deal with. Its dimensionality (14682x3000) emerged as a substantial obstacle during model training. For the first time, we confronted with the difficult decision of trading off accuracy for computational efficiency. Despite enduring lengthy waits for the optimization of hyperparameters—sometimes stretching into hours—we strived to seek even slight improvements in accuracy: starting from 90% of accuracy with Random Forest, we managed to achieve 94% by finding the optimal combination of hyperparameters, although it took several hours (and cool temperature in the room!). Ultimately, our search led us to a more intricate model capable of capturing relationships that were elusive to our previous models. This was the motivation behind the implementation of a small Neural Network for this particular dataset.

In [ ]:
hcc = pd.read_csv("drive/MyDrive/Datasets/HCC1806_Filtered_Normalised_3000_Data_train.txt", sep="\ ")
hcc = hcc.T
hcc['label'] = hcc.index.to_series().apply(lambda x: 'Normoxia' if 'Norm' in x else 'Hypoxia')
<ipython-input-31-471380791b73>:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  hcc = pd.read_csv("drive/MyDrive/Datasets/HCC1806_Filtered_Normalised_3000_Data_train.txt", sep="\ ")
In [ ]:
#Creating X and y
X = hcc.drop("label", axis = 1)
y = hcc["label"]
In [ ]:
#Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(11011, 3000) (11011,) (3671, 3000) (3671,)
In [ ]:
nn = MLPClassifier(random_state=42, batch_size='auto', max_iter=1000000, solver='sgd')
In [ ]:
# alert: this cell will run for approximately 25 minutes
params_grid = {'hidden_layer_sizes': [(50,), (100,), (50,50), (100,100)], 'learning_rate_init': [0.1, 0.01, 0.001]}
nn_gs = GridSearchCV(nn, params_grid, cv=3, verbose=2)
nn_gs.fit(X_train, y_train)
In [ ]:
# chosen MLP:
nn_best = MLPClassifier(random_state=42, batch_size='auto', max_iter=1000000, solver='sgd', hidden_layer_sizes=(100,), learning_rate_init=0.1)
nn_best.fit(X_train, y_train)
MLPClassifier(learning_rate_init=0.1, max_iter=1000000, random_state=42,
              solver='sgd')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MLPClassifier(learning_rate_init=0.1, max_iter=1000000, random_state=42,
              solver='sgd')
In [ ]:
ypred = nn_best.predict(X_test)
accuracy_score(y_test, ypred)
0.9591391991283029

We are quite content with the outcome as we've achieved nearly 96% accuracy by merely adjusting a few hyperparameters of the Neural Network, a process that took about 20 minutes. In contrast, a similar 94% accuracy level was achieved previously, but it required several hours of waiting. This performance not only highlights the efficiency of the Neural Network model but also its efficacy in this specific application.

This model has been used, together with the usual others we have trained on HCC, to strive to correctly predict the anonymous dataset.

Conclusion¶

Our group project has led us to a rigorous exploration of gene expression data, with the ultimate objective of distinguishing between hypoxic and normoxic conditions within single cells. This attempt encompassed a wide range of techniques and methodologies, from general EDA to principal component analysis, from clustering to predictive models such as logistic regression, support vector machines, random forests, neural networks, and finally ensemble learning. We tried to make judicious decisions along the way, such as adopting randomized search over grid search for large datasets to optimize computational efficiency and time. We also highlighted the trade-offs between precision and recall. An intriguing facet of our project was the extraction of feature importance, enabling us to identify genes that play a pivotal role in hypoxic conditions. This not only offers intriguing insights into the biological processes but also holds potential for further research. The ensemble learning approach integrated the strengths of various classifiers and reinforced prediction accuracy, with each model's vote being weighted according to its performance. This strategy lent our model robustness, enhancing our confidence in its predictive power. Through this project, we have not only tested our data analysis and machine learning skills but also gained insights into the intricate world of genetics and cancer biology. We hope our findings contribute to the larger conversation on cell conditions and cancer research.